Thursday Night

Paul Betts’s personal website / blog / what-have-you

I’m officially a GNOME developer…kind of

…or at least I have Subversion access for my language work in Beagle. I’ve been working on this for awhile now and the patches got so big that it was much easier to move it to a branch than try to maintain them all. You can check out my code at:

svn co http://svn.gnome.org/svn/beagle/branches/beagle-textcat-branch

What does this freaking do, you may ask? Well, the fundamental idea is that you can do a much better job at searching if you can split words up into the main piece, called the “stem”. For example “singing” => “sing”. As you can imagine, this is very language specific and tricky, so you need a way to figure out what language a piece of text is in before you can attempt to stem it. Fortunately, you can use a trick and do some statistical analysis (here’s the paper on it) to usually figure out the language, if you’ve got enough text.

As for the actual algorithm to stem the text, enter Snowball, a string processing language that’s built for writing stemmer programs. I ported the compiler generator to C# (wasn’t too hard, it had a generator for Java) so now the code is generated on-the-fly for a whole bunch of different languages from the Snowball source.

Now there’s only a few things left to do:

  • Load the stemmer classes at runtime using Reflection
  • Decide which one to use based on the language, and the biggest one…
  • Make sure that it doesn’t drive performance through the floor

Unfortunately, doing all this language analysis isn’t cheap, both in terms of memory and processor utilization. One of the things I have to do is go through the code with a fine-toothed comb trying to make it run really fast so Beagle doesn’t get bogged down

Written by Paul Betts

February 1st, 2007 at 12:02 am

Posted in Linux,Mono / .NET