Friday, June 5, 2015

Burst Sorter in C#


I wrote a preliminary Burst Trie implementation for Burst Sorting strings. It's available via MIT license on GitHub. At this point it's running about 30% faster than Array.Sort() for sorting arrays of strings representing all English words in Shakespeare's plays; its relative performance will vary depending on the number of strings you're sorting, but generally the larger the set the better off you are using a burst sort than a more traditional sort.

The built-in System.Array.Sort() method in .NET uses quicksort, heapsort, or insertionsort depending on the data; burst sort can beat these due to its cache-locality awareness.

I used these materials to learn about burst tries and burst sorting; I am indebted to their authors:

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
Burst Tries, A Fast, Efficient Data Structure For String Keys
Engineering BurstSort: Towards Fast, In-Place String Sorting

Please note that this code may not be exactly production ready for you: your mileage may vary and I'm continuing to work on it. I did not consult any existing source code; this is written from scratch. If you want to use it, make sure it behaves the way you expect. Feel free to reach out to me at pmcculler@gmail.com to ask questions or provide feedback, and of course, you can do the fork/pull thing on GitHub.

Up next for this project are some more correctness tests (yes, there are tests now!) and exploring various performance improvement possibilities, including sub-bucketing as described by Sinha and Wirth in their Engineering BurstSort paper, use of unsafe code, and implementing a Patricia burst trie (as far as I know, nobody has done that), and a number of minor todos scattered throughout the code.

There may be a pretty obvious question which is something like "if you're that concerned about performance, why are you using C# and not C/C++?"

  1. Most performance improvement comes from algorithmic improvement, not point-optimization. Example: running a loop N times instead of N^2 times is better and better the bigger N gets, even if you make each iteration substantially more efficient in the N^2 version.
  2. After the .NET JIT compiler gets done with the IL, similar code is approximately as fast in C# as in C++. You're free to argue with me on this but it's not going to be fruitful; we're talking a few percentage points, not 3x differences. I'm reasonably cautious about avoiding obvious problems here.
  3. People use C#. A lot. C++ is great but it can be a bit of a bear for some problems, so it's handy to have this.
  4. There doesn't appear to be any C# implementation available anywhere else, so there is novelty value.
  5. I get to do whatever I want with my time. ;)

Patrick


Sunday, April 12, 2015

Posting a couple of basic text tools: suffix array, trie

I extracted a couple of data structures I made for some personal projects and put them on GitHub.

One is a basic suffix array, another is a basic trie. Maybe they will be useful to you.

I'll add more over time or if asked - I have various other things here, like a patricia trie, a fractal trie compressor, a shortest common superstring / supersequence calculator. The SCS algorithm was actually kind of fun to build; getting it to run in a reasonable period of time was tricky. I wrote it to create a superstring for an extended English dictionary of 240K words, and it takes about 20 minutes to do that (unparalleled).

My greedy algorithm superstring of a 2,457,521 character dictionary is 1,858,802 characters long, about 75% of the simple concatenation. Greedy algorithm results are usually within a small constant multiple of the optimal shortest superstring, according to widespread research, so the optimal might be 1/3rd as large, but it won't be 1/9th. Don't really have np-complete time on whole dictionaries to make sure though.

The superstring was so I could compress tries a little more than patricia and fractal will allow alone; instead of holding a string of characters in each node, there's a pointer into a dictionary superstring and a length. There's a tradeoff in cache locality of course. In practice it didn't actually save much memory - the patricia node character runs were generally too short. I think real space savings would appear if I created a superstring for all node contents instead of all original words. There's not a lot of room for improvement with 240K-word tries, so I haven't done that yet. All English words from the Google 2012 corpus is going to be another story.

Tuesday, March 17, 2015

"...I have not done interviews for quite a while was hoping to get some guidance/advice from you on the best approach."



I am well and hope you are too!

Here's my general approach when I'm looking.
  • First thing I do is pull out a copy of Programming Interviews Exposed and work through the examples. Makes my mind sharp for interviews and helps substantially when I'm asked coding questions. 
  • Update LinkedIn & resume with current position, including all the latest appropriate buzzwords. 
  • Update past work in resumes & LinkedIn so it uses cutting edge words that weren't in use at the time, like 'big data' in 2005 or something like that. 
  • Update LinkedIn every day in some way so it comes up in recruiter searches (LinkedIn is actually a first stop for most recruiters at this point). 
  • Look at what contacts are doing and ask about opportunities. Probably have to ask a LOT of people, and honestly the careers page of their company's website is probably a better source of information. May be able to get a referral in though, getting through the first step of the gauntlet automatically. 
  • Look at the start-up scene in the Seattle area, several ways to do this. 
  • Check out the major companies with local presence such as Facebook, Twitter, etc and see what they are up to, submit to anything looking good. 
  • Contact any recruiter who's written to me or I've worked with before and let them know I am available starting soon or now (they love immediacy). 
  • Look at the boards i.e. http://careers.stackoverflow.com/
  • Sign up with TheLadders. See if they still have a free resume assessment service, that was useful (cursory, not the whole assessment, but a gloss that I found useful).
  • Consider getting a premium LinkedIn account. Seeing everyone who visited me and the like was nice when I was looking.
  • Stay relaxed and focused during interviews. People are there to judge you, yes, but you never have to see them again. Who cares what they think. You're getting interview practice, free coffee, and the possibility of a new job. All good things, no downsides that you don't bring to the table yourself.
  • Consider practicing an interview with someone, even a family member.
I always consider the first interview I have when i'm looking to be a loss - it's rarely a good outcome. Just getting into the groove of it. So maybe interview at a place you're not terribly interested in, for a job you're overqualified for, if you can get such an interview.

People look for 'passion' as much as for skills. I don't know why or what this means, except showing some excitement for new techniques and talking about ways you innovated or at least really cared about working on some product. Highlight all the best things, there's no reason not to. Think about this ahead of time.

That's all that comes to mind. I wish you luck! There's a killer market for great devs out there, so as long as people recognize that you have the skills and a passion for making products awesome, I suspect it will work out well.