Sunday, April 12, 2015

Posting a couple of basic text tools: suffix array, trie

I extracted a couple of data structures I made for some personal projects and put them on GitHub.

One is a basic suffix array, another is a basic trie. Maybe they will be useful to you.

I'll add more over time or if asked - I have various other things here, like a patricia trie, a fractal trie compressor, a shortest common superstring / supersequence calculator. The SCS algorithm was actually kind of fun to build; getting it to run in a reasonable period of time was tricky. I wrote it to create a superstring for an extended English dictionary of 240K words, and it takes about 20 minutes to do that (unparalleled).

My greedy algorithm superstring of a 2,457,521 character dictionary is 1,858,802 characters long, about 75% of the simple concatenation. Greedy algorithm results are usually within a small constant multiple of the optimal shortest superstring, according to widespread research, so the optimal might be 1/3rd as large, but it won't be 1/9th. Don't really have np-complete time on whole dictionaries to make sure though.

The superstring was so I could compress tries a little more than patricia and fractal will allow alone; instead of holding a string of characters in each node, there's a pointer into a dictionary superstring and a length. There's a tradeoff in cache locality of course. In practice it didn't actually save much memory - the patricia node character runs were generally too short. I think real space savings would appear if I created a superstring for all node contents instead of all original words. There's not a lot of room for improvement with 240K-word tries, so I haven't done that yet. All English words from the Google 2012 corpus is going to be another story.

No comments: