Choosing an Open Source Desktop Search Tool: Part 2

by Rich on February 23, 2010

This is a continuation of my work to sort out the most useful desktop search tool. You can read about the background and motivation in Part 1. In this post, I’ll detail my test plans and work through setting up the tools themselves.

My test platform is minimal and simple: Ubuntu 9.10 JEOS. I’ve updated it as of Feb 22, 2010, and added on a handful of packages: sshd, smb, some dev tools, sqlite, etc. What I don’t have is any kind of a desktop — everything I’m doing, I’ll be doing from scripts and command line. That’s because my goal is to find more of a back-end tool than to worry about mucking around with front ends.

I’ve built a small list of files to act as my ‘set of searchable stuff’. It’s built to give me enough available data that there are some real results and to validate the ability to index important file types.

[jeos1:~/search] ls -R
.:
archives  audio  docs  pics  video

./archives:
doublearchive.tar  pics.7z  pics.tar  pics.tar.bz2  pics.tgz  pics.zip

./audio:
01 Its On The Rocks.m4a
07 - Whiti Zombi - More Human Than Human.ogg
10 Make It Alright.mp3
17. Abba - Lay All Your Love On Me.flac
Spook Country (Unabridged), Part 1.aa

./docs:
avr.csv  avr.xls   gba.doc   gba.html  gba.pdf  gba.txt
avr.ods  avr.xlsx  gba.docx  gba.odt   gba.rtf

./pics:
biathlon.bmp  biathlon.jpg  biathlon.tga   gradient.svg
biathlon.gif  biathlon.png  biathlon.tiff

./video:
aaf-lunch.monkeys.s01e01-sample.avi  JaneAndTheDragon_S01E01_TestsAndJests.mpg
caprica.s01e01.pilot.sample.mp4      preview.wmv
Episode 133_ Capistrano Tasks.mov

File Details:

The archives are made up of the contents of the pics directory.  doublearchive.tar is a tar containing the other archives in thearchive folder.

Audio files are all tagged and non-DRM, with the exception of the .aa file.  This is an Audible.com file which uses aac encoding wrapped in Audible’s DRM.

There are lots of docs, but only two sets of text.  For all of the document formats (doc, html, pdf, txt, docx, odt, rtf), I used the text of the Gettysburg Address, threw it into OpenOffice Writer, and exported various files.  The spreadsheets (csv, xls, ods, xlsx) are data sheet summaries of AVRTiny microcontrollers, exported from OpenOffice Calc to various formats.

I was watching biathlon on the Olympics while working on this, so I though a biathlon image was apt.  The .svg is a sample oval of a color gradient I found on an SVG informational site.

Video files are previews of various shows or games I was able to find with the appropriate file formats, with the exception of the .mov, which is a podcast I happened to have on hand (BTW, I definitely recommend Railscasts if that’s your thing.)

My test system is running on VMWare ESXi, which makes it really easy to keep things clean and consistent.  I took a snapshot prior installing any search tools which allows me to roll-back to a clean system and put each alternative on a clean box.

Installation is a very simple process.  Using a clean system for each search tool, I installed the native ubuntu packages.

Beagle setup and searching.

Beagle was installed with a simple apt-get install beagle.  As I mentioned in Part 1, apt wants to add 208 packages when it installs Beagle.  A rough review of the packages it adds let me put them into three general categories:  parsers and support libraries, mono, and X/gnome.  In my minimal JEOS system mono is not available by default.  Beagle is Mono based, so it’s perfectly understandable that it should require Mono in order to install.  It’s a bunch of packages, but it’s in no way frivolous.  Similarly, all of the search tools leverage the work done by other projects in order to index more types of files.  Beagle is able to read a lot of file types, but it does it by making good use of existing libraries.  As a lot of those libraries are under the Gnome umbrella (something I’d love to have in my front yard), a lot of Gnome components go in as dependencies for Beagle’s dependencies.  The last set may well be driven by the Beagle GUI — you can’t get a Gnome GUI running without X and Gnome behind it, so it’s at least sensible.  It would be nice to be able to install a command-line only version of Beagle, though, particularly as it has provisions for non-Gnome interfaces (command line, web, etc.)

Beagle uses extended filesystem attributes as an important part of its indexing.  If extended attributes are unavailable, it can use a pure sqlite back-end, however, performance warnings abound.  Extended attributes are typically not enabled by default, but they’re very easy to turn on.  I needed to and user_xattr to the options field of /etc/fstab for each filesystem which contained directories to be indexed.  Then, the filesystems were re-mounted using mount -o remount <filesystem>.

By default, beagle is configured to index the running user’s home directory.  It’s configurable via the GUI or the sparsely documented beagle-config.  You can add directories to index via beagle-config FilesQueryable Roots </full/path>.  Directories may be excluded by beagle-config FilesQueryable ExcludeSubdirectory </full/path>.   Any mounted filesystem can be indexed, though for remote filesystems, NFS is likely preferable to SMB.  Beagle states a strong preference for the ability to use extended attributes, which are not supported as required by beagle over samba.

Also by default, beagle excludes $HOME/tmp from the index.  I used this for convenience in my testing, moving all of my test scripts, output and other cruft to $HOME/tmp, which allowed me to maintain a largely clean index of my planned search files.  I say almost because beagle did something I found very interesting.  If I made a quick modification to a test script using vi, beagle picked up the updated access time of my ~/.viminfo file and checked to see whether it needed to udpate the index.

I tried to keep search testing as simple as possible, building searches which had clearly expected results.  I used strings which were unique to file types (docs, spreadsheets, etc.).  Beagle’s Google-like syntax made for simple refinement.

Beagle’s results were largely satisfactory.  File name based searches provided excellent results.  It was able to search documents (pdf, txt, odt, rtf, html) with no problems.   Audio file tags (mp3, m4a, ogg) were also exactly as expected.   Searches of archives were very impressive.  The file name based searches showed that beagle was able to index not just inside archives but inside archives of archives (tar, tgz, zip, bz2).  Its ability to penetrate multiple levels deep into archives was a great surprise.

A datapoint which is currently meaningless is time to search.  It came in at:  10.09s user, 0.92s system,  11.654 total.  That won’t mean anything until we have similar info for the competing tools.

It did have a few problems, though.  I wasn’t able to get any results from any spreadsheets/structured documents.  This was surprising for a few reasons.  The csv with which Beagle was unable to cope is only text, something Beagle handled perfectly elsewhere.  The ods spreadsheet also failed, even though OpenOffice’s writer file was no problem.  No results were found for the contents of the svg file, which is also just text.

Further, Beagle gave me no results from any MS Office files (Office 2003 or Office 2007 format).

The project lists Office 2003 format files as supported, but so is svg, which is no more complicated than text.  I’m sure these are solvable problems, but MS Office documents and csv files are routine fare for most any user.  It shouldn’t take fixing.

Overall, Beagle’s results were good, but not great.  It’s fairly easy to configure, and it’s hard to give it a configuration which makes it fail as you configure through commands which reject bad information rather than config files which break because of it.  The Google-like search syntax (such as four score ext:pdf to get just the pdf result) is very familiar and comfortable, making narrow, specific queries easy to build.  Beagle’s failure, though, with MS Office files and with some structured text (csv and svg) were a surprise and a disappointment.

In Part 3 I’ll provide my impressions and results with tracker, recoll and strigi.  In Part 4, I’ll lay out general and detailed comparisons.

Update:  Turns out Beagle did have some success with spreadsheets.  I was a little more liberal with my search terms and managed to get some hits.  Still, with identical contents and identical search terms, Beagle only found content in the OpenOffice spreadsheet.  No results were available for identical Excel or CSV files.

Previous post:

Next post: