Monday, October 29, 2018

Linear time suffix tree construction

I needed (well... really really wanted) a linear-time online suffix tree for a project (linear time construction, specifically), and since that sort of thing isn't off-the-shelf it took some time and wrangling to put one together. I eventually started from scratch three times before I got the solution I wanted (i.e. one which is actualfax linear time construction and which always builds a correct suffix tree).

I implement advanced and specialized data structures and algorithms for fun, but I found Ukkonen's algorithm for online linear time suffix tree construction challenging to implement correctly, and it took some hours. It's not too hard to get an implementation to *usually* work, but if you want it to *always* work, there are a number of cases you have to consider carefully. I tend to try to work from original papers - read what the designers had to say and go from there. This is usually very effective, and I get a strong grasp of the problem and the author's solution, but with this paper a number of steps and cases were omitted. I was disappointed to discover this and it took time to determine what was missing and fill in the gaps in a manner consistent with the algorithm designer's intent. Here's Ukkonen's paper: https://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

Eventually I looked around the net to see how other people have dealt with this, and sure enough it does seem other people have encountered this before. Some explanations are out there on the net, and some implementation are available, some of which may work. Others definitely don't. I'm pretty confident in mine - certainly, it did everything I needed it to do, and passes a variety of tests.

This is a fast algorithm, though it dances around memory a lot - there's no cache awareness, no effort towards memory locality, that kind of thing. And the expansion over space is pretty dramatic: the in-memory suffix tree is likely to be many times larger than the original document, depending on its entropy. I'm noodling on ways to improve this, arranging the tree to reduce cache misses so it's even quicker to build and perhaps a bit more compact in memory overall.

The linear time suffix tree is here on GitHub, along with some other things I've had fun building: a union find, a burst true, burst sorter, tries, a cuckoo hash, etc. Enjoy.

No comments: