About two months ago, I added the ability to search blog postings within Jäger. You first saw this in 1.3 beta releases. As I thought about this more and more, I released there was something much more general purpose here: why not search anything with Jäger? I quickly implemented an Amazon interface using the pyamazon module and was quite pleased with the results. Each Amazon category books, dvd, video, and so forth was treated as a blog and each entry was a particular search result. I decided not to put this in the official Jäger 1.4 release because I though I had the beginning of something much more powerful here that needed to be done correctly.
Note that I'm talking about something orthogonal to what JWZ is mentioning here. I'm not talking about using RSS to return changing results from persistent search, though I think that's a great idea and can be easily implement with the libraries I'm releasing to you. I'm talking about using RSS to return search results i.e. something entirely ephemeral in nature; you look at the results, then discard them.
To do this, I created something I'm calling the Universal Search Interface. It's a Python library for searching ... anything. It's built on top a very powerful and easy to use scraping library called "Drücken" which lets me scrape ... well, almost anything with regular output. It doesn't have to use scrapers: it uses the pyamazon and pytechnorati libraries for accessing Amazon and Technorati.
But enough babbling from me. Here's the basic code (from a user's perspective) to do a search (each result is a single search element):
import Search for result in Search.seach(text = 'something'): pprint.pprint(result)Here's a few example search strings:
- search:Amazon buffy the vampire slayer type:dvd
- search:Google type:images "dan rather"
- J Janes search:Canda411 state:NL city:"St. John's" pages:2
- Dan Smith search:Whitepages state:NY city:"New York" pages:all
The only constant element here is the 'search:'. Every element with a colon in it is a 'restriction'. The restrictions that the USI directly recognizes are 'search:', 'type:' and 'pages:'. 'search:' allows the USI locate the searching class; 'type:' narrows the search to a particular sub-service of a search engine; and 'pages:' tells the maximum number of pages of search results that can be retrieved from the particular search service. The default is '1'; obviously the meaning of a page is highly dependent on the search engine being used. There may be other restrictions added called 'language:' and 'template:'.
Note also that the search interface is implemented as an iterator (using generators, actually). Thus search results must be retrieved starting at the very first result! Also note that searches like 'search:Google dog' may potentially retrieve hundreds of thousands of results which is very nasty. However, results are returned as soon as they're available, which is not only handy, but essential.
Here's some example output from the USI (for the Canada 411 search):
{'Address': u'39 Goldeneye Pl',
'City': u"St John's",
'Country': 'CA',
'FirstName': u'J',
'LastName': u'Janes',
'Name': u'J Janes',
'Phone': u'(709) 747-0979',
'PhoneURI': u'tel:+1-709-747-0979',
'State': u'NL',
'_link': u'http://findaperson.canada411.ca/more_info/...',
'_title': u'Janes, J'}
{'Address': u'8 Lynch Pl',
'City': u"St. John's",
'Country': 'CA',
'FirstName': u'J',
'LastName': u'Janes',
'Name': u'J Janes',
'Phone': u'(709) 722-8327',
'PhoneURI': u'tel:+1-709-722-8327',
'PostalCode': u'A1B 4L8',
'State': u'NL',
'_link': u'http://findaperson.canada411.ca/more_info/...',
'_title': u'Janes, J'}
...
The rules for the output format are quite simple:
- the only valid values in a result are Unicode strings, integers, floats, lists and dictionaries, with the later two being discouraged but not prohibited. Non-Unicode strings and classes or "bags" are not allowed
- names start with a underscore are reserved. The reserved names currently in use are '_title', '_link', '_html' and '_text'
The source code for this (and Jäger's "generic" library, which this depends on) will be released next week under standard Python source code license. The rest of Jäger will be released the week after under a different license, the details of which I'm still working on.
So, what does this have to do with RSS search results?
Well, there's another layer coming called the "Pylot interface". Pylots are little Python webservices that you can plug into a Pylot Engine. Jäger will be one of these, though there's no reason these can't be a different freestanding application. The idea I have is that there'll always a Python environment running on your desktop (which is what Jäger is) that you can access as a local webserver (maybe using twisted), a database such as MySQL if it's available, full access to the wxPython library and so forth. You want to do something? Just drop a piece of Pylot code in the correct directory and it's executing like a Windows application!
One possible idea for a Pylot is a front end to the Universal Search Engine that can return HTML, RSS 2 or even RDF results. Because it's on your own desktop and serving only 127.0.0.1, there's no worries about various terms of use that a public webserver would have. If the Pylot environment has MySQL, it's easy to implement JWZ's search engine result interface.
On the subject of RDF, perhaps one of you folks have some suggestions about how I could best return RDF results? It seems to me that this would be great for you semantic web types and could bootstrap your projects quite a bit. Does each USI class need to return a dictionary of what the terms mean, or can I just make up vocabularies ad-hoc?
Anyway, I'm way ahead of myself now. You'll see the USI on Wednesday.

