Evaluation of open source desktop search tools continue from Part 1, Part 2 and Part 3 with a late entry and some updates. During my work on Strigi, their documentation referred to related projects. Of the several other search tools mentioned, there was one which wasn’t already on my list or a defunct project: Pinot. Another C++ based and GPL2 licensed tool, Pinot uses a xapien back end for its index and relies on dbus for its interprocess communication. On its face, it’s very similar to recoll. In testing, it showed some interesting differences.
Pinot setup and searching
Pinot was installed with apt-get install pinot, keeping it very consistent with the other tools. Apt added 65 packages for pinot, keeping it at the lower end of additional packages. It was really in the same ballpark as recoll once recoll had enough packages to actually be functional. The great bulk of the packages were parsing libraries and support files rather than X oriented cruft. It does require dbus, and operationally is tightly coupled to it. Otherwise, it’s a fairly self contained tool. As a C++ tool, it didn’t have the baggage that beagle did.
Following installation, I found pinot to be very temperamental in my particular (and probably peculiar for pinot) environment. Pinot’s pinot-dbus-daemon is clearly a dbus oriented tool similar to trackerd. I needed a similar process to get the daemon running: dbus-launch bash followed by pinot-dbus-daemon. While this got the daemon running, no matter what I did or what configuration I tweaked, I wasn’t able to get the daemon to index anything or to pass me any information about why it was less than fully happy.
I was ultimately able to get some indexing done with pinot, though not with with its daemon. Pinot-index is a once-and-done command line indexer. It has a few nice options beyond simply indexing. My index was created with pinot-index –index -b xapian -d ~/.pinot/index.db ~/search. It appears you can give pinot an arbitrary set of files and index them to a single xapian database. So, substituting another path for ~/search is no problem. This can get messy if you haven’t kept track of what’s in the index, but pinot gives you a handy option for that. pinot-index –check -b xapian -d ~/.pinot/index.db ~/search will tell me whether the specified path (and subdirectories) are included in the index. That’s much cleaner than having to search for something and see if a useful result is returned.
Pinot’s search was at the quick end of the tools I evaluated. It clocked in at 0.42s user, 0.29s system, 0.727s total. It was by no means the fastest, but it was clearly in line with the higher performing tools.
The quality of the results was mid-pack. For documents, pinot only had hits in text, html, and csv files. It completely missed on pdf, wordprocessor and spreadsheets. It did manage to read the id3 tags on an mp3 file just fine, but tags on other audio formats were opaque. It was able to pull hits out of an archive — oddly, it was the tar.bz2. It had no luck with the zip, gzip or 7zip. it also failed to pull information out of the plain tar even though it was successful with a bzip compressed tar. I honestly don’t understand that at all.
A really nice diagnostic aspect, at this point, was the ability to check pinot’s index to see if a file was included in it. Pinot did not include in its index documents it couldn’t read (such as .doc or .docx), which is fine, but the ability to get at least filename information for these files is desirable. It did include audio files where it was unable to parse tags (such as .flac), allowing successful queries based on filename. Archives (including 7zip) were also in the index. This was nice, but failed to explain why the contents of the tar.bz2 were available where no other archive contents were.
Pinot-search has some other interesting features. It appears to be a more general purpose search tool. I used the xapian backend, which was a natural fit for my purposes. It also supports backends for opensearch, sherlock, and Google (using a Google API key). From a client perspective, this is really interesting, as it means indices could be maintaned in the format or through the indexing system most appropriate to the content being indexed and searches could be accessed from a single query tool. There are complexities there around how to know which backend to use, but it seems like a nice feature.
Overall, pinot was fair. It did manage to access some archived content, handled mp3 successfully and was a fast performer. It really fell down on word processor documents and its behavior with archives as a whole was very confusing.
In part 5, I’ll wrap up the evaluations and lay out the aggregate results.