I/O - Efficient Sorting
Sorting needed in many cases:
In database context, we sometimes want to sort in order to create an index on some data. We sometimes want to sort the output if we've got an order by in our query. Some join algorithms involves having data that's sorted. Constructing indices called a B+ tree involves doing some sorting. one way to eliminate duplicates is to sort the data and then you know. When we're grouping data in a group, one of the ways is by sorting the data according to the attribute you're grouping on.
Sorting can be used to remove duplicates by going through and see whether the subsequent thing is the same and eliminate it. This is a reason why you don't want to always put DISTINCT
into your queries if you don't need to because there is an expense involved in sorting the data.
Issues
When we look at performance, we're going to be looking at the IO - the number of accesses and transfers.
The data might not fit in main memory. If it fits in main memory, you can use whatever your favorite main memory sorting algorithm is.
Translating algorithms to this context when data on disk doesn't always work in terms of being efficient.
The ones that works well or has a good analog is "Merge Sort" whether the data is in memory or not.

Instead of dividing into two pieces and sorting them, we're going to divide into a bunch of pieces and then merge them back together - a multi way merge instead of a two way merge.
Merge Sort Example
Use d-way merging to merge d sorted files into one larger sorted file until only one file left.
Suppose we have data file 25.6GB (256 million records of 100 bytes each) and main memory available for sorting is 100MB.
Sort chunks
Chunk it up by taking as much as we can put into memory at a time - 100 megabytes.
Sort that once it's in memory, using any sorting algorithm.
Write that back out to the disk into a new file.
Repeat those three steps above until all data was read. Now original unsorted data file is a bunch of sorted chunks sitting on the disk.
Merge sorted chunks
In 1 pass: merge 256 files into one
In 2 passes: merge 256 files into 16 files, each of which are 256 times longer. And then take those 16 longer files and merge them into one.
How to merge d-lists?
In example above, d in 1-pass is 256 and in the 2-pass d is 16. To merge the d-lists, we're going to use heap data structure which takes time in finding the minimum, deleting the minimum, and inserting a new item.
What we want to do is to find the minimum of d-items.

Suppose d is three for illustration, so we'll have a heap that can hold d elements. We've got three files that are sorted, and we want to merge them. We have the each of these is sorted in increasing order, from left to right, so 2,6,7,11 for example, in the first one.
First, we take the smallest element from each of those and put them in the heap, and the result of that is that we have a heap where the smallest element at the top is 1, and the 2 and the 3 are below it. That's going to be sitting in memory.

We're initially putting the smallest from each input buffer into the heap, then we're repeatedly extracting the minimum and putting it into the output buffer which is also sitting in main memory and replacing it with the next element from the list where the minimum came from, then heapify again.
We're going to keep doing that until the heap is empty, in which case all d-lists are merged. Along the way, if the output buffer becomes full, it will flush that out to the disk and start building up a longer file, so we're going to end up with something that was d times as long as before or as our initial ones in a single sorted file.
Data access/movement during Merge
d-way merge: need d input buffers and 1 output buffer. In other words, it refers to the process of merging 'd' sorted lists into one sorted list.
In the context of external sorting, where we're dealing with large amounts of data that do not fit into main memory, we perform this merging process with the help of input buffers and an output buffer.

If output buffuer full, write it out and append to output file.
If input buffer empty, read next chunk of data from that file.
Back to Example





Relation to Unix Sort

I/O - Efficient Algorithms
IO efficient algorithms is area dealing with theory and the practice of finding algorithms that are efficient when the data is sitting on a disk.
There are algorithms out there for lots of different problems, and they come up in a lot of different areas.

Conclusions
External non-volatile memory (hard disks and SSDs) where we need to store our data is much slower than main memory.
When the data doesn't fit in main memory, we need to take that into account in terms of algorithm design. A lot of these algorithms are going to have the flavor of scanning, merging, splitting, sorting large data sets.
There's a trade-off between the chunk size and the number of passes, and finding the optimal balance depends on the specifics of the system and the dataset.
Last updated