 Squozen Sort

version 0.2
The question is sometimes raised whether it's possible to sort a set of, say, one million thirty-two-bit numbers using only, say, two million bytes of memory. At first blush, that many longs wouldn't seem to fit in that many bytes, but it can all be done in memory; this page describes one way, implemented in the program sort_n.c.

To clarify some assumptions: we'll read the numbers in ascii decimal from standard input, one line per number. They'll be unsigned longs. There may be duplicates. We'll have some fixed overhead, but the main arrays we sort in will total two million bytes. The results will be printed on standard output.

Compress as you sort.

What makes this possible is that a sorted list of numbers contains less information than the same list before sorting. The number of combinations of a million items, duplicates allowed, chosen from a set of 232 is about 213,511,283.1 (calculated here). It takes about 13.5M / 8 ~= 1.7M bytes to represent one such choice. If we can compress our intermediate sorting results nearly that well, we'll have enough memory to sort in. Here we use a one-level merge sort to do the sort as a whole.

Rice-coded differences

Sorting some of the input is the first compression step. The next is to encode the differences between successive numbers rather than the numbers themselves. I call the differences "gaps". Finally, we need an encoding that takes advantage of the fact that most of the gaps are small. The code used here is a Golomb/Rice code:
[1...] 0 bbbb bbbb bbbb
That is, zero or more one bits, a zero, and twelve binary bits. The leading ones represent gap / 4096 in unary, and the binary bits encode gap % 4096. For a million gaps that add up to 232, the code uses at most
232/4096 + 1,000,000*(1+12) = 14,048,576 bits = 1,756,072 bytes,
which is within 4% of the theoretical minimum mentioned above. It's easy to verify that the Rice code works well here, but the choice wasn't obvious or intuitive to me beforehand, so this page shows a systematic path to it.

The merge sort

Given this method of compressing sorted lists, a merge sort seems natural. The usual merge sort is a recursive binary merge, but the trouble with a recursive sort in this context is that our code is (and has to be) less efficient for smaller lists of numbers. I use a single-level merge here; here's why that's not terribly bad.

If we allocate a buffer for the final sorted list, the space we have left is...

2,000,000 - 1,756,072 = 243,928 bytes
...enough for a buffer of 60,982 32-bit numbers.

We will fill and sort the small buffer 1,000,000 / 60,982 < 17 times (but see below), and each time merge it into the big buffer. That means that in addition to the the work of sorting the small bufferfulls, we do about 1,000,000 * 17 / 2 comparisons, 8.5 * n, which is about half the number of comparisons within the sorts. But we also do about that number of compressions and decompressions, which are more expensive.

The growth of the buffer over merge passes looks like a triangle, which suggests O( n^2 ) space use, but the base of the triangle is set by the ratio of the two buffer sizes, not by the size of the input. As long as we're working in memory, and the two buffers' capacities are k:1, then the total work is

O( n * (k/2 + log( n/k )) ),
which is still O( n log n ). To merge the sorted numbers into a single buffer that can only fit the final result, first slide the compressed data to the end of the big buffer. Start uncompressing the old data, merging with the small buffer of sorted but uncompressed data, compressing the result, and filling in the big buffer from the beginning. Old compressed data is read and processed leaving space for new data that's been merged.

The slack space between the new data and the remaining old data shrinks because more data from the small buffer is being inserted. But because the big buffer is big enough for the final result, we know that the total of the new and remaining old data will fit without overlap-- the same calculation of the total size is valid at every intermediate point.

One more optimization: the last merge step doesn't need its results recompressed into the big buffer, they only need to be sent to the output. That means the big buffer's capacity can be reduced by the capacity of the small buffer. That in turn means the small buffer can be larger, which means the big buffer can be smaller, and so forth. I didn't look for a closed form or efficient way to find the squeezed buffer size, I just iterate until I get to a fixed point. For the original problem of one million longs in two million bytes, this reduces ~17 passes of ~61K items each to ~10 passes of ~103K.

Tests

The sort will work with less memory (down to a limit) at the expense of more passes, or with more memory and fewer passes. The bash script time_test runs the same sort problem with buffer sizes from 1,800,000 to 4,100,000 bytes. With the smallest buffer, 55 passes and 15.2 seconds are needed on my computer. With a 2MB buffer, 10 passes and 5.6 seconds, and with the largest buffer, the compression and merging is skipped entirely and the sort takes 4.6 seconds. In each of these, about 3 seconds are spent reading and writing the file.

That compares to 18.5 seconds for Unix's sort -n, but that's a little unfair since Unix sort works on the data as strings, sorts by arbitrary combinations of fields, and can sort a bigger variety of number formats.

Code and files

 Makefile (1K) for C programs, index.html, combos.html and test files combino.py_txt (3K) calculates table for combos.html combos.html (10K) derivation and table for combinatorial math in squozen_code.html combos.php.txt (4K) php source for combos.html factorial.py_txt (1K) Python procedures for n! and log-base-2( n! ) gamma.py_txt (10K) Python gamma and log-gamma functions used by factorial.py log2_factorial (1K) executable Python log-base-2( n! ) for large n pr_urands.c (2K) generate random test files sort_n.c (17K) the squozen sort program squozen_code.html (21K) arriving at the Rice code step by step time_test (1K) shows speed vs. buffer size. Also compares to Unix sort. time_test.out (14K) test results when run on my machine urands.txt.gz (5000K) random test file (gnuzip'ped) urands_w.txt.gz (4603K) random multiples of 4096, sort_n's worst case (gnuzip'ped)