I/O - Efficient Sorting

Sorting needed in many cases:

In database context, we sometimes want to sort in order to create an index on some data. We sometimes want to sort the output if we've got an order by in our query. Some join algorithms involves having data that's sorted. Constructing indices called a B+ tree involves doing some sorting. one way to eliminate duplicates is to sort the data and then you know. When we're grouping data in a group, one of the ways is by sorting the data according to the attribute you're grouping on.

Sorting can be used to remove duplicates by going through and see whether the subsequent thing is the same and eliminate it. This is a reason why you don't want to always put DISTINCT into your queries if you don't need to because there is an expense involved in sorting the data.

Issues

When we look at performance, we're going to be looking at the IO - the number of accesses and transfers.

  • The data might not fit in main memory. If it fits in main memory, you can use whatever your favorite main memory sorting algorithm is.

  • Translating algorithms to this context when data on disk doesn't always work in terms of being efficient.

  • The ones that works well or has a good analog is "Merge Sort" whether the data is in memory or not.

Instead of dividing into two pieces and sorting them, we're going to divide into a bunch of pieces and then merge them back together - a multi way merge instead of a two way merge.

Merge Sort Example

Use d-way merging to merge d sorted files into one larger sorted file until only one file left.

Suppose we have data file 25.6GB (256 million records of 100 bytes each) and main memory available for sorting is 100MB.

Sort chunks

  1. Chunk it up by taking as much as we can put into memory at a time - 100 megabytes.

  2. Sort that once it's in memory, using any sorting algorithm.

  3. Write that back out to the disk into a new file.

Repeat those three steps above until all data was read. Now original unsorted data file is a bunch of sorted chunks sitting on the disk.

Merge sorted chunks

  • In 1 pass: merge 256 files into one

  • In 2 passes: merge 256 files into 16 files, each of which are 256 times longer. And then take those 16 longer files and merge them into one.

How to merge d-lists?

In example above, d in 1-pass is 256 and in the 2-pass d is 16. To merge the d-lists, we're going to use heap data structure which takes lognlogn time in finding the minimum, deleting the minimum, and inserting a new item.

What we want to do is to find the minimum of d-items.

Suppose d is three for illustration, so we'll have a heap that can hold d elements. We've got three files that are sorted, and we want to merge them. We have the each of these is sorted in increasing order, from left to right, so 2,6,7,11 for example, in the first one.

First, we take the smallest element from each of those and put them in the heap, and the result of that is that we have a heap where the smallest element at the top is 1, and the 2 and the 3 are below it. That's going to be sitting in memory.

We're initially putting the smallest from each input buffer into the heap, then we're repeatedly extracting the minimum and putting it into the output buffer which is also sitting in main memory and replacing it with the next element from the list where the minimum came from, then heapify again.

We're going to keep doing that until the heap is empty, in which case all d-lists are merged. Along the way, if the output buffer becomes full, it will flush that out to the disk and start building up a longer file, so we're going to end up with something that was d times as long as before or as our initial ones in a single sorted file.

Data access/movement during Merge

d-way merge: need d input buffers and 1 output buffer. In other words, it refers to the process of merging 'd' sorted lists into one sorted list.

In the context of external sorting, where we're dealing with large amounts of data that do not fit into main memory, we perform this merging process with the help of input buffers and an output buffer.

larger d: fewer passes but smaller buffers, thus slower disk I/O

Each of the 'd' sorted lists has an associated input buffer in main memory where a portion of the list can be loaded. An additional output buffer is used to temporarily store the merged elements before they are written back to disk.

When you have a larger 'd' (meaning you merge more lists at once), you'll need fewer passes over the data to complete the sort, as each pass merges more lists together. This can make the sorting process faster overall.

Think of "d" as the number of sorted lists that you are merging at once. The larger the value of "d", the more sorted lists you can merge in a single pass.

For example, let's consider a situation where you have 64 sorted lists to merge.

  1. If your "d" value is 2 (meaning you can merge 2 sorted lists at a time), you'll need to perform 32 merge operations in the first pass to merge all the lists. After the first pass, you would have 32 sorted lists. In the second pass, you would have 16 lists, then 8 lists, then 4 lists, then 2 lists, and then finally 1 list. In total, you've done 6 passes over the data.

  2. Now let's say your "d" value is 4. This means you can merge 4 sorted lists at a time. In the first pass, you would perform 16 merge operations, resulting in 16 sorted lists. In the second pass, you would merge these down to 4 lists, and in the third pass, you would merge these into 1 final sorted list. In this case, you've done only 3 passes over the data.

So, as you can see from this example, a larger "d" value allows you to merge more lists at once, which reduces the total number of passes you need to make over the data to complete the sort. This can potentially make the sorting process faster overall because you're performing fewer passes.

However, keep in mind that this speedup comes with a trade-off. The larger your "d" value, the more input buffers you need (since you're merging more lists at once), and since your total amount of available memory is fixed, this means each individual buffer will be smaller. Smaller buffers can potentially slow down the process because they need to be refilled from disk more often, and disk operations are generally much slower than memory operations.

In other words, with a larger 'd', each individual merge operation can handle more data, but the smaller buffer sizes can mean more frequent disk I/O, which is generally slow compared to in-memory operations. The optimal 'd' value usually involves a balance between these two opposing factors.

  • If output buffuer full, write it out and append to output file.

  • If input buffer empty, read next chunk of data from that file.

Back to Example

We're reading in and writing out, so double it

Relation to Unix Sort

I/O - Efficient Algorithms

IO efficient algorithms is area dealing with theory and the practice of finding algorithms that are efficient when the data is sitting on a disk.

There are algorithms out there for lots of different problems, and they come up in a lot of different areas.

Conclusions

  • External non-volatile memory (hard disks and SSDs) where we need to store our data is much slower than main memory.

  • When the data doesn't fit in main memory, we need to take that into account in terms of algorithm design. A lot of these algorithms are going to have the flavor of scanning, merging, splitting, sorting large data sets.

  • There's a trade-off between the chunk size and the number of passes, and finding the optimal balance depends on the specifics of the system and the dataset.

Last updated