Choosing an Open Source Desktop Search Tool: Part 1

by Rich on February 20, 2010

I have a few projects cooking that rely on full-content search.  There’s been a lot of work that’s gone into a number of tools, and the range of options has reached the point where there’s no clear-cut leader.  A number of people have done some work to compare a few of them, but I haven’t found anything both comprehensive and recent.  This is the first of 4 parts detailing my investigation of what seem to me to be the leading tools out there.

Search is a pretty huge universe, and I’m really focused on a small part of it — I need a tool which works in a desktop environment and which is embeddable in a server/system.  I have no interest in trying to out-crawl Google or index the universe.  But I have a lot of files, and, all too often, I’ll discover I have 3-4 copies of something stashed in different places because I couldn’t find what I was looking for.  Finding what I already have is the problem I’m trying to solve.

I want to use open source because I want to be able to tweak and play, and because, depending on how my projects proceed, I may want to make it part of a larger tool set.  So, it has to be some sort of FOSS licensed tool.  I have no dog in the hunt over which license it is.

From what I’ve been able to find, there are 4 leading tools out there which fit both of the above criteria:  Beagle, Tracker, Recoll, and Strigi.  There are a lot of similarities, of course, since they’re all search tools, but they’re different enough to make an evaluation meaningful.  Update:  I’ve added Pinot to the list of tools to test.

My interest is primarily in managing the indexing/searching, interfacing with the tool, and getting good results.  I’m not so interested in the gui provided by any particular tool.  There are a lot of posts out there which show you screenshot upon screenshot.  This is not one of those.  I’m looking at installation/configuration, features, interface capability, and performance (specifically with respect to quality of results, possibly with respect to speed).  All of the work I’ll be doing with these tools will be in a pretty minimal environment:  Ubuntu 9.10 JEOS with current updates applied, but not much else done to it, other than ssh and smb added for my convenience.

A quick overview:

Beagle

Beagle is Mono based, and is one of what appear to be the two more popular tools.  It uses a C# port of Apache Lucene as its search indexer, and is typically used through its GTK front end, making it more common in Gnome based environments.  It uses DBUS for client/server interaction.

Tracker

A Gnome desktop project, Tracker, also called MetaTracker, is a C based tool and is the other search which shares the popularity lead with Beagle.  It also uses DBUS for IPC, and stores its search metadata in a sqlite database.

Recoll

A Qt project, Recoll is the only tool I found using the xapian search engine library.  I don’t know if that’s good or bad, but it’s at least different.  Curiously, Recoll can use the Beagle browser front end, a javascript client which pushes search to a queue which Recoll can service if Beagle isn’t running.

Strigi

Part of KDE’s semantic desktop, Strigi uses a C port of Lucene as its search index by default, but also supports HyperEstraier and has sqlite and Xapian back ends under development.  CLucene is reported as preferred and fastest, but its plugable nature is pretty interesting.

In prep for testing I wanted to see what it would take to get the tools installed on my systems.  Pretty recent versions of each tool were available in Ubuntu’s repositories, and a quick check revealed my first interesting metric — the number of packages apt wanted to install:

Beagle: 208

Tracker: 201

Recoll: 37

Strigi: 34

It’s a very stark difference.  I haven’t yet dug into what the huge numbers are for Beagle and Tracker.  I suspect that, since they’re very desktop-user oriented, Ubuntu has all kinds of dependencies which lead to an installation of a full end-user desktop environment.  That’s not something I’m interested in for my JEOS server, but I’m even less interested in building and maintaining my own packages.  I’ll see if I can pare things down a bit before the actual install.

Stay tuned for Part 2, which will lay out my testing plans and move into working with each of the tools.

Previous post:

Next post: