Software Design Using C++
External sorting refers to the sorting of a file that is on disk (or tape). Internal sorting refers to the sorting of an array of data that is in RAM. The main concern with external sorting is to minimize disk access since reading a disk block takes about a million times longer than accessing an item in RAM (according to Shaffer -- see the reference at the end of this document).
Perhaps the simplest form of external sorting is to use a fast internal sort with good locality of reference (which means that it tends to reference nearby items, not widely scattered items) and hope that your operating system's virtual memory can handle it. (Quicksort is one sort algorithm that is generally very fast and has good locality of reference.) If the file is too huge, however, even virtual memory might be unable to fit it. Also, the performance may not be too great due to the large amount of time it takes to access data on disk.
Most external sort routines are based on mergesort. They typically break a large data file into a number of shorter, sorted runs. These can be produced by repeatedly reading a section of the data file into RAM, sorting it with ordinary quicksort, and writing the sorted data to disk. After the sorted runs have been generated, a merge algorithm is used to combine sorted files into longer sorted files. The simplest scheme is to use a 2-way merge: merge 2 sorted files into one sorted file, then merge 2 more, and so on until there is just one large sorted file. A better scheme is a multiway merge algorithm: it might merge perhaps 128 shorter runs together.
According to Shaffer, a multiway merge using half a megabyte of RAM and a disk block size of 4 KB could hold 128 disk blocks in RAM at once. This would allow 128 runs to be merged together in one pass. The average initial run size would be 1 MB. (See Shaffer on how that can be obtained with only 1/2 MB of RAM.) A file of size 128 MB could be sorted in 2 passes (one to build the initial runs and one to merge them). A file of size 16 gigabytes could be sorted in just 3 passes.
Note that you do not want to jump back and forth between 2 or more files in trying to merge them (while writing to a third file). This would likely produce a lot of time-consuming disk seeks. Instead, on a single-user PC, it is better to read a block of each of the 2 (or more) files into RAM and carry out the merge algorithm there, with the output also kept in a buffer in RAM until the buffer is filled (or we are out of data) and only then writing it out to disk. When the merge algorithm exhausts one of the blocks of data, refill it by reading from disk another block of the associated file. This is called buffering. On a larger machine where the disk drive is being shared among many users, it may not make sense to worry about this as the read/write head is going to be seeking all over the place anyway.
Shaffer presents the following practical data concerning external sorting. In this experiment a 4 MB file was sorted on a particular computer. A simple mergesort that did not build initial sorted runs took 451 seconds. A 2-way mergesort that used initial runs of 128 KB took only 160 seconds. A multiway mergesort that used the same initial runs took only 103 seconds. Clearly, using initial sorted runs dramatically speeds up the sorting.
The ideas behind an external sort seem simple enough, but implementing a working program is fairly complex. The following example attempts to show the main features used in most any external sort: producing sorted initial runs, the merging of sorted runs, and the buffering of data. However, the design used is simpler than that which is most likely used in a real-world external sort. A key place where this is true is that the example program merges only 2 sorted files at a time; it does not attempt to do a multiway merge (such as the 128-way merge mentioned above). The program also uses buffers of size 64 KB, which is no doubt smaller than necessary. Performance would probably be better with larger buffers. The buffers were kept small so that the merge portion of the algorithm could be observed without needing a huge test file. Note that when the program is creating a sorted run, it uses a single 64 KB buffer, but when it is merging a couple of sorted runs, it uses three 64 KB buffers. A more likely scenario in a good external sort is that the same amount of memory is used in both cases, no matter how many buffers exist in each case. Buffers all of the same size were used to keep the example simpler.
Note that the header file has a symbol DEBUG that can be defined if you want to see debugging output. Comment this line off if you do not want this information to appear on the screen. Shown below are the debugging messages produced when sorting a modified copy of the linux.words file. (This file had a couple of extra words added. Also, since the words were already in order, the file was sorted, with the Linux sort command, starting at the third character of each line. This scrambled the order of the words, giving appropriate test data for the external sort program.) Note that the test data file contained 45429 words, one per line.
The above output shows that the promised pattern of merges was indeed used by the external sort program. Note that when the number of sorted runs is odd, the remaining file is simply renamed. The final merge shows that when only 2 sorted runs remain, they are merged back into the original file.
Those who are interested might want to try modifying the example program to use a multiway merge and possibly the "replacement selection" algorithm discussed in Shaffer's text. See the references below for more information.
See the references below for more complete information and more advanced methods. Then try writing your own external sort!