Friday, June 5, 2015

Burst Sorter in C#


I wrote a preliminary Burst Trie implementation for Burst Sorting strings. It's available via MIT license on GitHub. At this point it's running about 30% faster than Array.Sort() for sorting arrays of strings representing all English words in Shakespeare's plays; its relative performance will vary depending on the number of strings you're sorting, but generally the larger the set the better off you are using a burst sort than a more traditional sort.

The built-in System.Array.Sort() method in .NET uses quicksort, heapsort, or insertionsort depending on the data; burst sort can beat these due to its cache-locality awareness.

I used these materials to learn about burst tries and burst sorting; I am indebted to their authors:

Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
Burst Tries, A Fast, Efficient Data Structure For String Keys
Engineering BurstSort: Towards Fast, In-Place String Sorting

Please note that this code may not be exactly production ready for you: your mileage may vary and I'm continuing to work on it. I did not consult any existing source code; this is written from scratch. If you want to use it, make sure it behaves the way you expect. Feel free to reach out to me at pmcculler@gmail.com to ask questions or provide feedback, and of course, you can do the fork/pull thing on GitHub.

Up next for this project are some more correctness tests (yes, there are tests now!) and exploring various performance improvement possibilities, including sub-bucketing as described by Sinha and Wirth in their Engineering BurstSort paper, use of unsafe code, and implementing a Patricia burst trie (as far as I know, nobody has done that), and a number of minor todos scattered throughout the code.

There may be a pretty obvious question which is something like "if you're that concerned about performance, why are you using C# and not C/C++?"

  1. Most performance improvement comes from algorithmic improvement, not point-optimization. Example: running a loop N times instead of N^2 times is better and better the bigger N gets, even if you make each iteration substantially more efficient in the N^2 version.
  2. After the .NET JIT compiler gets done with the IL, similar code is approximately as fast in C# as in C++. You're free to argue with me on this but it's not going to be fruitful; we're talking a few percentage points, not 3x differences. I'm reasonably cautious about avoiding obvious problems here.
  3. People use C#. A lot. C++ is great but it can be a bit of a bear for some problems, so it's handy to have this.
  4. There doesn't appear to be any C# implementation available anywhere else, so there is novelty value.
  5. I get to do whatever I want with my time. ;)

Patrick