Choosing an Open Source Desktop Search Tool: Part 3

by Rich on February 28, 2010

My search testing continues in this post with tracker, details on using tracker, recoll, and strigi.  My overall intent and plans are laid out in Part 1.  Testing environment details and my work with beagle appear in Part 2.

Tracker setup and searching

As with beagle, tracker was installed using apt-get install tracker.  Apt had a hefty package count for tracker — 201 for tracker vs. 208 for beagle.  These fell into only two general buckets, though:  Tracker and its related libraries/parsers and X/gnome.  Tracker is a C based tool, so there was no need for all of the Mono related additions.  This put tracker’s effective package count a bit higher than beagle’s.  It had a comparable set of tracker and parser related packages as well as a substantial amount of X related infrastructure.  For my purposes, that’s totally unnecessary — I have no plans to use the GUI — but if the GUI is assumed to be part of the tool, it’s understandable.  Tracker does have provisions for non-gui access, though it’s somewhat less friendly than Beagle (more below).

I ran into some early problems.  After completing the initial tracker installation, I couldn’t get trackerd to start, with complaints about hald (Hardware Abstraction Layer Daemon) being unavailable.  It turns out my JEOS environment didn’t have hald installed.  That was fixed with apt-get install hal and service hal start, but it was annoying as I was trying to determine if it was a problem with environment settings or packages.  This also is clearly not a fault of tracker per se, but rather tracker’s packaging in Ubuntu and unidentified dependencies.

Trackerd is explicitly and entirely dependent on dbus .  When run within a gnome session, this is readily available.  In my command line only environment, there was no dbus session instance waiting for me.  The obvious way to do this is to use dbus-launcher to kick off trackerd.  For my purposes, though, it wasn’t ideal.  When launching dbus from inside a shell, the dbus session is a child of the shell, and its children have access to the dbus session.  I needed both trackerd and the search tools to have access to the dbus session instance, so I used dbus-launcher bash to give myself a new shell from which everything would have access to the dbus session instance I was using for tracker.  Startup, then, was dbus-launcher bash, and from that shell, trackerd &.

That got trackerd running and searchable, but let me without any indexing.  Verbose mode (trackerd -v 2 &) told me that trackerd was running in read-only mode.  This hunt was a little more annoying, as the debugging message of read-only mode is not consistent with the language used in the config file or documentation.  Tracker.cfg  has a number of directives which enable content indexing, several of which were disabled.  I had to go in and turn on EnableWatching, EnableIndexing and EnableContentIndexing.  It’s possible I could have gotten by with fewer of these, but when the config file comments tell me these ‘disable all indexing’, I opted for the shotgun approach.  This, again, is not a limitation of tracker, but reflects on Ubuntu’s packaging decisions.  I don’t know that it’s even a serious concern, as any configuration of tracker on the desktop would surely fix these values, it most cases it would be transparent to the user.  I expect it’s more a result of my focus on the command line as contrary to Ubuntu’s focus on the desktop.

Directories to crawl and inotify monitor can be configured in the same tracker.cfg with the CrawlDirectory and WatchDirectoryRoots options respectively.  As with beagle, by default tracker monitors the user’s home directory.  Unlike beagle, which adds a config entry per directory, tracker lists multiple directories on a single line.

Tracker search syntax is different from beagle’s.  Where beagle is google-like, tracker uses RDF/SPARQL syntax.  In principle, this is very powerful, as SPARQL is very flexible and SQL-like.  As a practical matter, it’s probably less accessible to typical users.  There are two different command line search tools:  tracker-search and tracker-query.  Tracker-search takes simple arguments which AND together the strings in the search.  Tracker-query runs the query in a SPARQL file the searcher must provide.  I wanted to keep things simple, so I limited my searches to tracker-search.

Tracker did some things very well.  Document searches did everything beagle did, plus it successfully indexed MS Word 2003 documents.  It was not successful with Word 2007 format documents.  For spreadsheets, it also was only able to provide results for the OpenOffice spreadsheet, returning no results for the xls or csv files.  Audio file tags were effective for mp3 and ogg, but not for m4a.  Filename searches were solid for any files directly visible to tracker, but no results were provided for files contained in an archive of any kind.

The time to search was drastically different from beagle’s.  Tracker came in at 0.110s user, 0.140 system, 0.298 total.  It can’t possibly be a linear, completely scalable relationship, but a difference of roughly 40x is very notable.

On the whole, tracker provided less complete results than beagle, though the addition of MS Word 2003 files was an extremely important benefit.  It is considerably more temperamental when moved outside of its ‘standard’ desktop environment.  The need to have access to the same dbus session significantly complicates and limits how tracker can be used outside of an X desktop.  With beagle, I could search from any instance of any shell on the system where beagled was running.  With tracker, I had to run trackerd and tracker-search as children of the same shell in order to keep them connected to the same dbus session instance.  Were I using tracker as a back end for a search service, this wouldn’t necessarily be an issue, as dbus-launcher could load the service which could  in turn simply launch trackerd into the background.  Still, it made trackerd troublesome, harder to debug, and likely less robust.  On more than one occasion, I had to stop trackerd which failed to de-register itself from dbus as it shut down.  On restarting tracker, the service would be blocked as the name was already in use.  This was intermittent, but it did require me to kill my shell and restart from dbus-launcher bash.

Recoll setup and searching

Recoll setup followed form with beagle and tracker:  apt-get install recoll did the trick.  Departing form beagle and tracker, though, was recoll’s package count.  It needed a mere 37 packages.  It still had the same primary categories as tracker — recoll tools and search-related libraries and X related cruft.  Recoll’s GUI is Qt based, which differs from beagle and tracker, and the amount of X related extras is clearly considerably lower.

Recoll stores all of its indexes using xapiandb, a C++ based open source search engine library.  It’s a highly portable, light weight engine which offers pretty sophisticated search capabilities and a very generous set of interfaces in most mainstream languages — better than tracker or beagle.

Getting recoll running was very easy once it was installed.  Recollindex will index the user’s home directory by default.  Alternate/additional directories can be configured using recoll.conf, as can a variety of indexing options including alternate or additional languages.  This seems to apply to xapian’s use of priority for important words and what constitutes an important word in a particular language (though I didn’t test to sort out exactly what was happening here.

Just launching recollindex produces a very different result than kicking of beagled or trackerd.  Recollindex, by default, will index the configured directories, store that info in the database, and exit.  It does not load as a daemon by default.  Using recollindex -m will launch it in daemon mode and will do the same filesystem monitoring (depending on your configuration) as with the other tools.

The index-then-exit behavior highlights another interesting difference in recoll.  The search client access the index database directly.  So it is a first class database client rather than a client of the indexing daemon.  So if you have a relatively static system, a system with limited resources, or a system which only needs its index updated periodically, you could leave recollindex off most of the time, saving the resources.  Search would continue to be available on demand, as the recoll search client would simply read the xapian data store directly.

Recoll can use the same client binary, recoll,  for its Qt GUI and command line searches.  Documentation indicates that, nominally, the preferred command line client is recollq, but this was not included in the standard Ubuntu packages.  Again, shame on Ubuntu rather than recoll, but it meant it was unavailable for my tests.  This wasn’t an issue, however.  Running recoll, it tries to connect to an X server, but by giving it a -t option, it behaves as a command line tool.

Recoll’s query structure defaults to wasabi, a project intended to provide standardization among desktop search.  It’s structured based on input from the four tools I’m evaluating.  As a practical matter, recoll queries were effective when structured similarly to beagle queries using a google-like format.

In its standard installation, recoll results were quite anemic.  It failed to index any MS Word, OpenOffice or PDF files and it failed to parse any audio file tags.  Search hits were only returned for text, html and csv file contents.  All filename searches worked well.  Within that limitation, though, recollindex was quite clear about what it was indexing, where it was failing, and what it was missing which limited its capabilities.  A quick review of the recollindex errors showed that I needed to add antiword, catdoc, flac, id3v2, libid3-dev, poppler-utils, unrtf, unzip, vorbis-tools, and xsltproc.  Using apt-get to install this and their dependencies added another 27 packages, bringing the total to 64.

The addition of helper’s changed recoll’s results dramatically.  It was now able to index all document formats – the only so far to handle docx, xls, and xlsx.  This also made recoll the only tool so far to handle all three types of spreadsheets.  With xsltproc available, recoll also became the first tested tool to pull information out of the svg file.  It had no trouble with mp3, ogg, or flac tags, but was unable to use any of the m4a tags.  Disappointingly, even with the addition of unzip, recoll was unable to provide any results from inside any archive format.

Search time was quite speedy, at 0.32s user, 0.31s system, and 0.639 total.  This is in the same league as tracker and considerably faster than beagle.

Strigi setup and search

Strigi’s setup differed from the other three tools.  Ubuntu separates the daemon and and client utilities into two separate pacakges.  Apt-get install strigi-daemon and strigi-utils was required.  Apt added 34 packages, putting it in the same ballpark as recoll’s initial installation (and with similar classes of packages — search tools and parsing libraries were predominant).

Strigi uses CLucene as its storage/search back end by default.  Lucene, an Apache project, is a Java based full-text search engine.  CLucene is a C++ port of Lucene, eliminating the need for Java.  Interestingly, strigi has a pluggable storage back end.  Currently supported are CLucene and hyperestraier,  with sqlite3 and xapian in the works.

Strigi is built by default to use dbus, but does not require it.  Dbus can be disabled at compile time or can be disabled in strigi’s daemon.conf.  In either case, strigi allows you to access the daemon via a socket rather than dbus.

Strigi performed very well with all of the document formats.  Hits were found even in the more troublesome doc and docx formats, as well as all structured/spreadsheets.   The svg file was treated as plain XML, and good results were returned on that as well.  Most interestingly with the svg file, strigi indexed its contents from inside a number of archives — tgz, tar.b2z, zip all had their contents properly indexed by strigi.  It wasn’t able to handle the 7z archive, but nothing has handled that.  Strigi gave no hits based on audio file tags.  In its defense, it doesn’t claim to be able to index these files.  Still, it’s a significant omission – reading audio file tags is hardly cutting edge technology.  Strigi was also unable to provide any hits by filename, which was baffling. paring the results down by extension/type was similarly disappointing.  That could be a limitation of my ability, but the other tools made it quite easy and intuitive.

Strigi queries returned extremely verbose results.  Where most tools returned a set of file URIs and possibly a hit count, strigi  provided the file matched, mime type, size, time of last modification, and a “fragment” of the file, which in some cases turned out to be the entire contents of the file which contained the search string.  It also provided a lot of information on the ontologies (data and file relationship characteristics used to categorize information in the search engine) of the associated files.  There are some potentially interesting possibilities provided by this behavior (such as, for example, search highlighting based strictly on the returned results rather than a subsequent file retrieval), but it also leads to a lot more information to be handled in order to do anything with the results.  Just reading strigi results was substantially more complicated than with any of the other tools.

Strigi was pretty fast, with the same test searches running in 1.02s user, 0.55s system,  1.571 total.  This was a big slower than recoll or tracker, but still substantially quicker than beagle.  I used a similar search script for each tool, adjusting for local syntax.  For beagle, tracker, and recoll, my results were in the 85-120 line range.  Strigi, for the same searches, gave me more than 3700 lines.  So, a lot more data was thrown into the results.  That may or may not be valuable, but strigi clearly did a lot in that time.

I’m changing plans slightly for Part 4.  In my research about strigi, it referenced a number of other desktop search tools.  Of those, only one was a real project which was still being maintained.  So, I’m going to dig into Pinot a bit.  If it turns out to do anything, I’ll add it to the mix in Part 4.

Previous post:

Next post: