Smartphones, Malware and Payment Systems

by Rich on March 10, 2011

A flurry of Android malware has been in the news lately, including some discussion of a hack which roots the device. That’s as significant as a compromise gets, but it’s not very interesting. Malware has been rooting devices for a long time, and Android, like anything else will have exploitable vulnerabilities.

Much more interesting to me is a trojan app which runs up charges on premium SMS numbers. It’s simple as far as attacks go. The app appears to be a media player, but sends expensive texts in the background. It’s also very clever, as it takes advantage of a payment system that’s not usually seen as an attack target. The credit card industry does a huge amount of work to prevent fraud. Merchants are held to strict security requirements. There are clear mechanisms to dispute charges you didn’t make and clear standards by which disputes are resolved.

Does any of that apply to text messages? Good luck. Phone companies are not known for their customer friendly (or customer transparent) dispute policies.

By targeting a payment system which was light on controls and light on merchant burdens for charge validations, the attackers smoothed their path to a lot of cash. As more types of purchases and more types of payment systems are tied to mobile devices, credit card levels of controls are going to have to come in to play to maintain consumer confidence in mobile payments. Their flexibility makes securing them more difficult than it is to secure credit cards.

For example, if I make a purchase on the app store, I need to enter my iTunes password. That’s a sensible control before allowing a payment. Once I enter my password, though, there’s a grace period where I can go back and buy more apps without reentering it. A clever trojan could watch for app purchases and then launch its own requests during that grace period. I might not know about it until a day or two later when I get an email receipt from Apple, assuming I even read it.

The impending onslaught of NFC capabilities in mobile phones is going to make this worse. If NFC based payments are simple credit card proxies, existing fraud detection mechanisms still come into play. There’s no telling how Apple and Google are going to try to build these markets, though. If your phone is just a proxy for your card, Apple or Google can’t take their cut of your every purchase. We could easily see iPay launched this summer, with your iTunes account debited by a wave of your phone over a retail kiosk. If mobile providers broker the transactions to credit card companies, it changes the fraud detection landscape. If it does, it’s unlikely to be for the better.

{ Comments on this entry are closed }

Smartphones and Steganography

by Rich on February 11, 2011

Security researchers published a scary proof of concept attack on Android smartphones. It’s a pair of Trojan apps which cooperate to steal credit card numbers — either spoken into the phone or entered on the keypad — and then covertly relay them back to the attacker. The attack was very cleverly done and highlights new threats enabled by powerful mobile devices. They received a flurry of publicity, but I think the coverage missed one of the really interesting points of their attack. It’s a practical application of steganography to create a covert communication channel inside a device.

Steganography is the discipline of creating secret messages so that only the sender and the receiver are aware of the existence of the messages. When we think about secret communications, we generally think about encryption, codes, etc. Steganography is different. Encryption and codes (basically) prevent someone who intercepts a message from either reading it or modifying it covertly. Identifiable messages are flying around, but no unwanted senders or receivers can participate in the conversation. Steganograhpy makes the existence of the messages themselves unknown. It’s possible to combine steganography with encryption so that both the existence and the contents of your messages are secret, but they do different things. Encryption makes it unreadable, steganography makes it invisible.

Steganography has some practical and some fanciful uses. It appears most commonly in some printers which print an almost invisible pattern of colored dots on their pages, encoding identifying information about the printer. Presumably, this is to help track down criminals such as counterfeiters. It’s also used in digital watermarking, where identifying information is added to copyrighted materials. If they’re then distributed in violation of copyright, the distributor can be identified by the watermark.

Steganography is often discussed in a freedom fighter context. An individual subject to a repressive government can use steganography to communicate with supporters inside our outside the regime’s control without fear of detection since their communications are invisible. There has been speculation that steganography has been used by some spy agencies and terrorist groups for exactly that purpose.

All of these uses are intended to keep messages secret from other people.

What I find interesting with the smartphone research is that it uses steganography to keep communications secret from the device that is performing the communication. Android limits app-to-app direct communication as a security measure. The researchers used innocuous system settings and system alert mechanisms to create a secret communication channel between applications. The effect of keeping the covert communication secret from the phone is to keep it secret from the phone’s owner, but the goal was to fool the device itself.

That has interesting implications when you think about zone-oriented security (sandboxes, DMZs, virtual environments, “the green zone”). Security zones always share some kind of resources with the systems they inhabit. If they didn’t, they’d be completely isolated, and there wouldn’t be any need to establish a security zone. By compromising those shared resources, the researchers created an unanticipated use of a benign tool. By using the device which performs security monitoring as the steganographic tool to deliver an illicit message, it’s extremely hard to detect the wrongdoing as it occurs. It’s essentially a heist movie where the bank guard is tricked into carrying a big bag of cash out to the waiting getaway car.

{ Comments on this entry are closed }

Evaluation of open source desktop search tools continue from Part 1, Part 2 and Part 3 with a late entry and some updates.  During my work on Strigi, their documentation referred to related projects.  Of the several other search tools mentioned, there was one which wasn’t already on my list or a defunct project:  Pinot.  Another C++ based and GPL2 licensed tool, Pinot uses a xapien back end for its index and relies on dbus for its interprocess communication.  On its face, it’s very similar to recoll.  In testing, it showed some interesting differences.

Pinot setup and searching

Pinot was installed with apt-get install pinot, keeping it very consistent with the other tools.  Apt added 65 packages for pinot, keeping it at the lower end of additional packages.  It was really in the same ballpark as recoll once recoll had enough packages to actually be functional.  The great bulk of the packages were parsing libraries and support files rather than X oriented cruft.  It does require dbus, and operationally is tightly coupled to it.  Otherwise, it’s a fairly self contained tool.  As a C++ tool, it didn’t have the baggage that beagle did.

Following installation, I found pinot to be very temperamental in my particular (and probably peculiar for pinot) environment.  Pinot’s pinot-dbus-daemon is clearly a dbus oriented tool similar to trackerd.  I needed a similar process to get the daemon running:  dbus-launch bash followed by pinot-dbus-daemon.  While this got the daemon running, no matter what I did or what configuration I tweaked, I wasn’t able to get the daemon to index anything or to pass me any information about why it was less than fully happy.

I was ultimately able to get some indexing done with pinot, though not with with its daemon.  Pinot-index is a once-and-done command line indexer.  It has a few nice options beyond simply indexing.  My index was created with pinot-index –index -b xapian -d ~/.pinot/index.db ~/search.  It appears you can give pinot an arbitrary set of files and index them to a single xapian database.  So, substituting another path for ~/search is no problem.  This can get messy if you haven’t kept track of what’s in the index, but pinot gives you a handy option for that.  pinot-index –check -b xapian -d ~/.pinot/index.db ~/search will tell me whether the specified path (and subdirectories) are included in the index.  That’s much cleaner than having to search for something and see if a useful result is returned.

Pinot’s search was at the quick end of the tools I evaluated.  It clocked in at 0.42s user, 0.29s system, 0.727s total.  It was by no means the fastest, but it was clearly in line with the higher performing tools.

The quality of the results was mid-pack.  For documents, pinot only had hits in text, html, and csv files.  It completely missed on pdf, wordprocessor and spreadsheets.  It did manage to read the id3 tags on an mp3 file just fine, but tags on other audio formats were opaque.  It was able to pull hits out of an archive — oddly, it was the tar.bz2. It had no luck with the zip, gzip or 7zip.  it also failed to pull information out of the plain tar even though it was successful with a bzip compressed tar.  I honestly don’t understand that at all.

A really nice diagnostic aspect, at this point, was the ability to check pinot’s index to see if a file was included in it.  Pinot did not include in its index documents it couldn’t read (such as .doc or .docx), which is fine, but the ability to get at least filename information for these files is desirable.  It did include audio files where it was unable to parse tags (such as .flac), allowing successful queries based on filename.  Archives (including 7zip) were also in the index.  This was nice, but failed to explain why the contents of the tar.bz2 were available where no other archive contents were.

Pinot-search has some other interesting features.  It appears to be a more general purpose search tool. I used the xapian backend, which was a natural fit for my purposes.  It also supports backends for opensearch, sherlock, and Google (using a Google API key).  From a client perspective, this is really interesting, as it means indices could be maintaned in the format or through the indexing system most appropriate to the content being indexed and searches could be accessed from a single query tool.  There are complexities there around how to know which backend to use, but it seems like a nice feature.

Overall, pinot was fair.  It did manage to access some archived content, handled mp3 successfully and was a fast performer.  It really fell down on word processor documents and its behavior with archives as a whole was very confusing.

In part 5, I’ll wrap up the evaluations and lay out the aggregate results.

{ Comments on this entry are closed }

My search testing continues in this post with tracker, details on using tracker, recoll, and strigi.  My overall intent and plans are laid out in Part 1.  Testing environment details and my work with beagle appear in Part 2.

Tracker setup and searching

As with beagle, tracker was installed using apt-get install tracker.  Apt had a hefty package count for tracker — 201 for tracker vs. 208 for beagle.  These fell into only two general buckets, though:  Tracker and its related libraries/parsers and X/gnome.  Tracker is a C based tool, so there was no need for all of the Mono related additions.  This put tracker’s effective package count a bit higher than beagle’s.  It had a comparable set of tracker and parser related packages as well as a substantial amount of X related infrastructure.  For my purposes, that’s totally unnecessary — I have no plans to use the GUI — but if the GUI is assumed to be part of the tool, it’s understandable.  Tracker does have provisions for non-gui access, though it’s somewhat less friendly than Beagle (more below).

I ran into some early problems.  After completing the initial tracker installation, I couldn’t get trackerd to start, with complaints about hald (Hardware Abstraction Layer Daemon) being unavailable.  It turns out my JEOS environment didn’t have hald installed.  That was fixed with apt-get install hal and service hal start, but it was annoying as I was trying to determine if it was a problem with environment settings or packages.  This also is clearly not a fault of tracker per se, but rather tracker’s packaging in Ubuntu and unidentified dependencies.

Trackerd is explicitly and entirely dependent on dbus .  When run within a gnome session, this is readily available.  In my command line only environment, there was no dbus session instance waiting for me.  The obvious way to do this is to use dbus-launcher to kick off trackerd.  For my purposes, though, it wasn’t ideal.  When launching dbus from inside a shell, the dbus session is a child of the shell, and its children have access to the dbus session.  I needed both trackerd and the search tools to have access to the dbus session instance, so I used dbus-launcher bash to give myself a new shell from which everything would have access to the dbus session instance I was using for tracker.  Startup, then, was dbus-launcher bash, and from that shell, trackerd &.

That got trackerd running and searchable, but let me without any indexing.  Verbose mode (trackerd -v 2 &) told me that trackerd was running in read-only mode.  This hunt was a little more annoying, as the debugging message of read-only mode is not consistent with the language used in the config file or documentation.  Tracker.cfg  has a number of directives which enable content indexing, several of which were disabled.  I had to go in and turn on EnableWatching, EnableIndexing and EnableContentIndexing.  It’s possible I could have gotten by with fewer of these, but when the config file comments tell me these ‘disable all indexing’, I opted for the shotgun approach.  This, again, is not a limitation of tracker, but reflects on Ubuntu’s packaging decisions.  I don’t know that it’s even a serious concern, as any configuration of tracker on the desktop would surely fix these values, it most cases it would be transparent to the user.  I expect it’s more a result of my focus on the command line as contrary to Ubuntu’s focus on the desktop.

Directories to crawl and inotify monitor can be configured in the same tracker.cfg with the CrawlDirectory and WatchDirectoryRoots options respectively.  As with beagle, by default tracker monitors the user’s home directory.  Unlike beagle, which adds a config entry per directory, tracker lists multiple directories on a single line.

Tracker search syntax is different from beagle’s.  Where beagle is google-like, tracker uses RDF/SPARQL syntax.  In principle, this is very powerful, as SPARQL is very flexible and SQL-like.  As a practical matter, it’s probably less accessible to typical users.  There are two different command line search tools:  tracker-search and tracker-query.  Tracker-search takes simple arguments which AND together the strings in the search.  Tracker-query runs the query in a SPARQL file the searcher must provide.  I wanted to keep things simple, so I limited my searches to tracker-search.

Tracker did some things very well.  Document searches did everything beagle did, plus it successfully indexed MS Word 2003 documents.  It was not successful with Word 2007 format documents.  For spreadsheets, it also was only able to provide results for the OpenOffice spreadsheet, returning no results for the xls or csv files.  Audio file tags were effective for mp3 and ogg, but not for m4a.  Filename searches were solid for any files directly visible to tracker, but no results were provided for files contained in an archive of any kind.

The time to search was drastically different from beagle’s.  Tracker came in at 0.110s user, 0.140 system, 0.298 total.  It can’t possibly be a linear, completely scalable relationship, but a difference of roughly 40x is very notable.

On the whole, tracker provided less complete results than beagle, though the addition of MS Word 2003 files was an extremely important benefit.  It is considerably more temperamental when moved outside of its ‘standard’ desktop environment.  The need to have access to the same dbus session significantly complicates and limits how tracker can be used outside of an X desktop.  With beagle, I could search from any instance of any shell on the system where beagled was running.  With tracker, I had to run trackerd and tracker-search as children of the same shell in order to keep them connected to the same dbus session instance.  Were I using tracker as a back end for a search service, this wouldn’t necessarily be an issue, as dbus-launcher could load the service which could  in turn simply launch trackerd into the background.  Still, it made trackerd troublesome, harder to debug, and likely less robust.  On more than one occasion, I had to stop trackerd which failed to de-register itself from dbus as it shut down.  On restarting tracker, the service would be blocked as the name was already in use.  This was intermittent, but it did require me to kill my shell and restart from dbus-launcher bash.

Recoll setup and searching

Recoll setup followed form with beagle and tracker:  apt-get install recoll did the trick.  Departing form beagle and tracker, though, was recoll’s package count.  It needed a mere 37 packages.  It still had the same primary categories as tracker — recoll tools and search-related libraries and X related cruft.  Recoll’s GUI is Qt based, which differs from beagle and tracker, and the amount of X related extras is clearly considerably lower.

Recoll stores all of its indexes using xapiandb, a C++ based open source search engine library.  It’s a highly portable, light weight engine which offers pretty sophisticated search capabilities and a very generous set of interfaces in most mainstream languages — better than tracker or beagle.

Getting recoll running was very easy once it was installed.  Recollindex will index the user’s home directory by default.  Alternate/additional directories can be configured using recoll.conf, as can a variety of indexing options including alternate or additional languages.  This seems to apply to xapian’s use of priority for important words and what constitutes an important word in a particular language (though I didn’t test to sort out exactly what was happening here.

Just launching recollindex produces a very different result than kicking of beagled or trackerd.  Recollindex, by default, will index the configured directories, store that info in the database, and exit.  It does not load as a daemon by default.  Using recollindex -m will launch it in daemon mode and will do the same filesystem monitoring (depending on your configuration) as with the other tools.

The index-then-exit behavior highlights another interesting difference in recoll.  The search client access the index database directly.  So it is a first class database client rather than a client of the indexing daemon.  So if you have a relatively static system, a system with limited resources, or a system which only needs its index updated periodically, you could leave recollindex off most of the time, saving the resources.  Search would continue to be available on demand, as the recoll search client would simply read the xapian data store directly.

Recoll can use the same client binary, recoll,  for its Qt GUI and command line searches.  Documentation indicates that, nominally, the preferred command line client is recollq, but this was not included in the standard Ubuntu packages.  Again, shame on Ubuntu rather than recoll, but it meant it was unavailable for my tests.  This wasn’t an issue, however.  Running recoll, it tries to connect to an X server, but by giving it a -t option, it behaves as a command line tool.

Recoll’s query structure defaults to wasabi, a freedesktop.org project intended to provide standardization among desktop search.  It’s structured based on input from the four tools I’m evaluating.  As a practical matter, recoll queries were effective when structured similarly to beagle queries using a google-like format.

In its standard installation, recoll results were quite anemic.  It failed to index any MS Word, OpenOffice or PDF files and it failed to parse any audio file tags.  Search hits were only returned for text, html and csv file contents.  All filename searches worked well.  Within that limitation, though, recollindex was quite clear about what it was indexing, where it was failing, and what it was missing which limited its capabilities.  A quick review of the recollindex errors showed that I needed to add antiword, catdoc, flac, id3v2, libid3-dev, poppler-utils, unrtf, unzip, vorbis-tools, and xsltproc.  Using apt-get to install this and their dependencies added another 27 packages, bringing the total to 64.

The addition of helper’s changed recoll’s results dramatically.  It was now able to index all document formats – the only so far to handle docx, xls, and xlsx.  This also made recoll the only tool so far to handle all three types of spreadsheets.  With xsltproc available, recoll also became the first tested tool to pull information out of the svg file.  It had no trouble with mp3, ogg, or flac tags, but was unable to use any of the m4a tags.  Disappointingly, even with the addition of unzip, recoll was unable to provide any results from inside any archive format.

Search time was quite speedy, at 0.32s user, 0.31s system, and 0.639 total.  This is in the same league as tracker and considerably faster than beagle.

Strigi setup and search

Strigi’s setup differed from the other three tools.  Ubuntu separates the daemon and and client utilities into two separate pacakges.  Apt-get install strigi-daemon and strigi-utils was required.  Apt added 34 packages, putting it in the same ballpark as recoll’s initial installation (and with similar classes of packages — search tools and parsing libraries were predominant).

Strigi uses CLucene as its storage/search back end by default.  Lucene, an Apache project, is a Java based full-text search engine.  CLucene is a C++ port of Lucene, eliminating the need for Java.  Interestingly, strigi has a pluggable storage back end.  Currently supported are CLucene and hyperestraier,  with sqlite3 and xapian in the works.

Strigi is built by default to use dbus, but does not require it.  Dbus can be disabled at compile time or can be disabled in strigi’s daemon.conf.  In either case, strigi allows you to access the daemon via a socket rather than dbus.

Strigi performed very well with all of the document formats.  Hits were found even in the more troublesome doc and docx formats, as well as all structured/spreadsheets.   The svg file was treated as plain XML, and good results were returned on that as well.  Most interestingly with the svg file, strigi indexed its contents from inside a number of archives — tgz, tar.b2z, zip all had their contents properly indexed by strigi.  It wasn’t able to handle the 7z archive, but nothing has handled that.  Strigi gave no hits based on audio file tags.  In its defense, it doesn’t claim to be able to index these files.  Still, it’s a significant omission – reading audio file tags is hardly cutting edge technology.  Strigi was also unable to provide any hits by filename, which was baffling. paring the results down by extension/type was similarly disappointing.  That could be a limitation of my ability, but the other tools made it quite easy and intuitive.

Strigi queries returned extremely verbose results.  Where most tools returned a set of file URIs and possibly a hit count, strigi  provided the file matched, mime type, size, time of last modification, and a “fragment” of the file, which in some cases turned out to be the entire contents of the file which contained the search string.  It also provided a lot of information on the ontologies (data and file relationship characteristics used to categorize information in the search engine) of the associated files.  There are some potentially interesting possibilities provided by this behavior (such as, for example, search highlighting based strictly on the returned results rather than a subsequent file retrieval), but it also leads to a lot more information to be handled in order to do anything with the results.  Just reading strigi results was substantially more complicated than with any of the other tools.

Strigi was pretty fast, with the same test searches running in 1.02s user, 0.55s system,  1.571 total.  This was a big slower than recoll or tracker, but still substantially quicker than beagle.  I used a similar search script for each tool, adjusting for local syntax.  For beagle, tracker, and recoll, my results were in the 85-120 line range.  Strigi, for the same searches, gave me more than 3700 lines.  So, a lot more data was thrown into the results.  That may or may not be valuable, but strigi clearly did a lot in that time.

I’m changing plans slightly for Part 4.  In my research about strigi, it referenced a number of other desktop search tools.  Of those, only one was a real project which was still being maintained.  So, I’m going to dig into Pinot a bit.  If it turns out to do anything, I’ll add it to the mix in Part 4.

{ Comments on this entry are closed }

This is a continuation of my work to sort out the most useful desktop search tool. You can read about the background and motivation in Part 1. In this post, I’ll detail my test plans and work through setting up the tools themselves.

My test platform is minimal and simple: Ubuntu 9.10 JEOS. I’ve updated it as of Feb 22, 2010, and added on a handful of packages: sshd, smb, some dev tools, sqlite, etc. What I don’t have is any kind of a desktop — everything I’m doing, I’ll be doing from scripts and command line. That’s because my goal is to find more of a back-end tool than to worry about mucking around with front ends.

I’ve built a small list of files to act as my ‘set of searchable stuff’. It’s built to give me enough available data that there are some real results and to validate the ability to index important file types.

[jeos1:~/search] ls -R
.:
archives  audio  docs  pics  video

./archives:
doublearchive.tar  pics.7z  pics.tar  pics.tar.bz2  pics.tgz  pics.zip

./audio:
01 Its On The Rocks.m4a
07 - Whiti Zombi - More Human Than Human.ogg
10 Make It Alright.mp3
17. Abba - Lay All Your Love On Me.flac
Spook Country (Unabridged), Part 1.aa

./docs:
avr.csv  avr.xls   gba.doc   gba.html  gba.pdf  gba.txt
avr.ods  avr.xlsx  gba.docx  gba.odt   gba.rtf

./pics:
biathlon.bmp  biathlon.jpg  biathlon.tga   gradient.svg
biathlon.gif  biathlon.png  biathlon.tiff

./video:
aaf-lunch.monkeys.s01e01-sample.avi  JaneAndTheDragon_S01E01_TestsAndJests.mpg
caprica.s01e01.pilot.sample.mp4      preview.wmv
Episode 133_ Capistrano Tasks.mov

File Details:

The archives are made up of the contents of the pics directory.  doublearchive.tar is a tar containing the other archives in thearchive folder.

Audio files are all tagged and non-DRM, with the exception of the .aa file.  This is an Audible.com file which uses aac encoding wrapped in Audible’s DRM.

There are lots of docs, but only two sets of text.  For all of the document formats (doc, html, pdf, txt, docx, odt, rtf), I used the text of the Gettysburg Address, threw it into OpenOffice Writer, and exported various files.  The spreadsheets (csv, xls, ods, xlsx) are data sheet summaries of AVRTiny microcontrollers, exported from OpenOffice Calc to various formats.

I was watching biathlon on the Olympics while working on this, so I though a biathlon image was apt.  The .svg is a sample oval of a color gradient I found on an SVG informational site.

Video files are previews of various shows or games I was able to find with the appropriate file formats, with the exception of the .mov, which is a podcast I happened to have on hand (BTW, I definitely recommend Railscasts if that’s your thing.)

My test system is running on VMWare ESXi, which makes it really easy to keep things clean and consistent.  I took a snapshot prior installing any search tools which allows me to roll-back to a clean system and put each alternative on a clean box.

Installation is a very simple process.  Using a clean system for each search tool, I installed the native ubuntu packages.

Beagle setup and searching.

Beagle was installed with a simple apt-get install beagle.  As I mentioned in Part 1, apt wants to add 208 packages when it installs Beagle.  A rough review of the packages it adds let me put them into three general categories:  parsers and support libraries, mono, and X/gnome.  In my minimal JEOS system mono is not available by default.  Beagle is Mono based, so it’s perfectly understandable that it should require Mono in order to install.  It’s a bunch of packages, but it’s in no way frivolous.  Similarly, all of the search tools leverage the work done by other projects in order to index more types of files.  Beagle is able to read a lot of file types, but it does it by making good use of existing libraries.  As a lot of those libraries are under the Gnome umbrella (something I’d love to have in my front yard), a lot of Gnome components go in as dependencies for Beagle’s dependencies.  The last set may well be driven by the Beagle GUI — you can’t get a Gnome GUI running without X and Gnome behind it, so it’s at least sensible.  It would be nice to be able to install a command-line only version of Beagle, though, particularly as it has provisions for non-Gnome interfaces (command line, web, etc.)

Beagle uses extended filesystem attributes as an important part of its indexing.  If extended attributes are unavailable, it can use a pure sqlite back-end, however, performance warnings abound.  Extended attributes are typically not enabled by default, but they’re very easy to turn on.  I needed to and user_xattr to the options field of /etc/fstab for each filesystem which contained directories to be indexed.  Then, the filesystems were re-mounted using mount -o remount <filesystem>.

By default, beagle is configured to index the running user’s home directory.  It’s configurable via the GUI or the sparsely documented beagle-config.  You can add directories to index via beagle-config FilesQueryable Roots </full/path>.  Directories may be excluded by beagle-config FilesQueryable ExcludeSubdirectory </full/path>.   Any mounted filesystem can be indexed, though for remote filesystems, NFS is likely preferable to SMB.  Beagle states a strong preference for the ability to use extended attributes, which are not supported as required by beagle over samba.

Also by default, beagle excludes $HOME/tmp from the index.  I used this for convenience in my testing, moving all of my test scripts, output and other cruft to $HOME/tmp, which allowed me to maintain a largely clean index of my planned search files.  I say almost because beagle did something I found very interesting.  If I made a quick modification to a test script using vi, beagle picked up the updated access time of my ~/.viminfo file and checked to see whether it needed to udpate the index.

I tried to keep search testing as simple as possible, building searches which had clearly expected results.  I used strings which were unique to file types (docs, spreadsheets, etc.).  Beagle’s Google-like syntax made for simple refinement.

Beagle’s results were largely satisfactory.  File name based searches provided excellent results.  It was able to search documents (pdf, txt, odt, rtf, html) with no problems.   Audio file tags (mp3, m4a, ogg) were also exactly as expected.   Searches of archives were very impressive.  The file name based searches showed that beagle was able to index not just inside archives but inside archives of archives (tar, tgz, zip, bz2).  Its ability to penetrate multiple levels deep into archives was a great surprise.

A datapoint which is currently meaningless is time to search.  It came in at:  10.09s user, 0.92s system,  11.654 total.  That won’t mean anything until we have similar info for the competing tools.

It did have a few problems, though.  I wasn’t able to get any results from any spreadsheets/structured documents.  This was surprising for a few reasons.  The csv with which Beagle was unable to cope is only text, something Beagle handled perfectly elsewhere.  The ods spreadsheet also failed, even though OpenOffice’s writer file was no problem.  No results were found for the contents of the svg file, which is also just text.

Further, Beagle gave me no results from any MS Office files (Office 2003 or Office 2007 format).

The project lists Office 2003 format files as supported, but so is svg, which is no more complicated than text.  I’m sure these are solvable problems, but MS Office documents and csv files are routine fare for most any user.  It shouldn’t take fixing.

Overall, Beagle’s results were good, but not great.  It’s fairly easy to configure, and it’s hard to give it a configuration which makes it fail as you configure through commands which reject bad information rather than config files which break because of it.  The Google-like search syntax (such as four score ext:pdf to get just the pdf result) is very familiar and comfortable, making narrow, specific queries easy to build.  Beagle’s failure, though, with MS Office files and with some structured text (csv and svg) were a surprise and a disappointment.

In Part 3 I’ll provide my impressions and results with tracker, recoll and strigi.  In Part 4, I’ll lay out general and detailed comparisons.

Update:  Turns out Beagle did have some success with spreadsheets.  I was a little more liberal with my search terms and managed to get some hits.  Still, with identical contents and identical search terms, Beagle only found content in the OpenOffice spreadsheet.  No results were available for identical Excel or CSV files.

{ Comments on this entry are closed }

I have a few projects cooking that rely on full-content search.  There’s been a lot of work that’s gone into a number of tools, and the range of options has reached the point where there’s no clear-cut leader.  A number of people have done some work to compare a few of them, but I haven’t found anything both comprehensive and recent.  This is the first of 4 parts detailing my investigation of what seem to me to be the leading tools out there.

Search is a pretty huge universe, and I’m really focused on a small part of it — I need a tool which works in a desktop environment and which is embeddable in a server/system.  I have no interest in trying to out-crawl Google or index the universe.  But I have a lot of files, and, all too often, I’ll discover I have 3-4 copies of something stashed in different places because I couldn’t find what I was looking for.  Finding what I already have is the problem I’m trying to solve.

I want to use open source because I want to be able to tweak and play, and because, depending on how my projects proceed, I may want to make it part of a larger tool set.  So, it has to be some sort of FOSS licensed tool.  I have no dog in the hunt over which license it is.

From what I’ve been able to find, there are 4 leading tools out there which fit both of the above criteria:  Beagle, Tracker, Recoll, and Strigi.  There are a lot of similarities, of course, since they’re all search tools, but they’re different enough to make an evaluation meaningful.  Update:  I’ve added Pinot to the list of tools to test.

My interest is primarily in managing the indexing/searching, interfacing with the tool, and getting good results.  I’m not so interested in the gui provided by any particular tool.  There are a lot of posts out there which show you screenshot upon screenshot.  This is not one of those.  I’m looking at installation/configuration, features, interface capability, and performance (specifically with respect to quality of results, possibly with respect to speed).  All of the work I’ll be doing with these tools will be in a pretty minimal environment:  Ubuntu 9.10 JEOS with current updates applied, but not much else done to it, other than ssh and smb added for my convenience.

A quick overview:

Beagle

Beagle is Mono based, and is one of what appear to be the two more popular tools.  It uses a C# port of Apache Lucene as its search indexer, and is typically used through its GTK front end, making it more common in Gnome based environments.  It uses DBUS for client/server interaction.

Tracker

A Gnome desktop project, Tracker, also called MetaTracker, is a C based tool and is the other search which shares the popularity lead with Beagle.  It also uses DBUS for IPC, and stores its search metadata in a sqlite database.

Recoll

A Qt project, Recoll is the only tool I found using the xapian search engine library.  I don’t know if that’s good or bad, but it’s at least different.  Curiously, Recoll can use the Beagle browser front end, a javascript client which pushes search to a queue which Recoll can service if Beagle isn’t running.

Strigi

Part of KDE’s semantic desktop, Strigi uses a C port of Lucene as its search index by default, but also supports HyperEstraier and has sqlite and Xapian back ends under development.  CLucene is reported as preferred and fastest, but its plugable nature is pretty interesting.

In prep for testing I wanted to see what it would take to get the tools installed on my systems.  Pretty recent versions of each tool were available in Ubuntu’s repositories, and a quick check revealed my first interesting metric — the number of packages apt wanted to install:

Beagle: 208

Tracker: 201

Recoll: 37

Strigi: 34

It’s a very stark difference.  I haven’t yet dug into what the huge numbers are for Beagle and Tracker.  I suspect that, since they’re very desktop-user oriented, Ubuntu has all kinds of dependencies which lead to an installation of a full end-user desktop environment.  That’s not something I’m interested in for my JEOS server, but I’m even less interested in building and maintaining my own packages.  I’ll see if I can pare things down a bit before the actual install.

Stay tuned for Part 2, which will lay out my testing plans and move into working with each of the tools.

{ Comments on this entry are closed }

Homemade Electronics

by Rich on February 15, 2010

I’ve been interested in microcontrollers for a long time. They’re small and cheap, they’re extremely customizable, and if you want to make a piece of electronics do something, they give you a balance between complete efficiency and flexibility. There are a lot of places where it’s likely possible to solve a problem using only passive components, but a microcontroller essentially lets you solve it by explaining what you want (i.e. programming it). Sounds just about perfect to me.

I have a handful of projects I’m kicking around. Some are more curiosities, some have a little more substance to them. The more substantial projects tend to feed other hobbies.

My planned projects, in no particular order, are:

Persistence of Vision (POV) based display

Automated pickup winder

Morse code keyer, and a decoder if I feel fancy

Various CNC projects: small soft materials/PCB router, larger project/wood router, laser engraving

Should keep me busy for a while

{ Comments on this entry are closed }

A Christmas Zither

by Rich on February 10, 2010

A zither is about as simple a stringed instrument as you can get, at least when it comes to finding something for a young child to play. I built a Zither for my three year old daughter for Christmas. It’s a small, 10 string harp. Really, it’s little more than a trapezoidal box with some strings stretched across it. So the strings are directly over the soundboard and resonating chamber as they are with a guitar.

Zither

I built it out of materials which were mostly on-hand. The frame is made of maple 1×2 from Menards — nothing special. The sounds board and back are made from 1/4″ plywood. I used an oak veneered plywood since that’s what was available locally for a good price. I’d have preferred maple so it would better match the frame, but it still looks pretty nice. I built bridges to support the strings at each end out of scrap maple, and topped them each off with a length of brass brazing rod.

The brazing rod provided a much harder surface for the strings. This keeps them from cutting into the comparatively soft maple bridges, lets them transfer their pressure down into the soundboard more effectively, and gives them a more stable platform, allowing them to stay in tune better. Not bad for $.15 worth of material I already had lying around.

Strings are plain old ball-end electric guitar strings. I used 16’s, as those were heavy enough that I figured they’d be hard for my daughter to break, but are still plain strings. I drilled up through one side of the frame and recessed holes to retain the strings, but keep the ball ends out of sight. I was able to use a single gauge of strings since they get shorter as they move up the instrument. It kept things very simple.

I did have to use one specialized kind of hardware. The plain end of the strings needed to be anchored and adjustable for tuning. For this, I had to use autoharp pins. They fit by friction, and priced similarly to guitar stings, provide a very economical way to tune the harp.

I tuned the instrument in G. We’re predominantly a guitar household, and G tuning works well for a lot of guitar music. I could have tuned to C as well, but coming up to G gave me a little more tension on the strings which seems to work better. With ten strings, it gives me an octave plus a third, which works well for a lot of songs.

Overall, I think it was a huge success. For under $20, I was able to put together a really nice sounding instrument which plays well, fits well into a lot of songs, and is able to stand up to being used by a young child.

{ Comments on this entry are closed }