Tree Structures

Binary Search Tree

This version of a binary search tree, mixing together the data and the index so that we don't have a separate file with the data. But instead of the data record for 11 being in the top node having the key value 11, it has also a pointer to where in some file the record with attribute 11 is. In the underlying file, along with the 11, a pointer to a record where this attribute is 11 and it also has some other attributes.

In a binary search tree, we're taking advantage of the fact that there's an ordering of these numbers in order to be able to search through the tree. If I'm searching for the number 12, it is greater than 11, so I go to the right and then I compare 12 to 15 and it is less than 15, so I go to the left and 12 is less than 13, so I go to the left and then I found it at the bottom right. If I'm searching for something that's not in the tree like 7, then I'll do the same thing and I know there's nothing there, so 7 is not in the tree.

B+ Tree

B+ tree has some commonalities with this idea of binary search, but B+ trees that we're going to see are not binary nodes, it will have many children instead of just zero, one or two children. They are also going to be perfectly balanced.

The degree is much greater than two for the trees that we're using in databases.
Keys along with pointers into our data file are all going to be in the leaf nodes. In the internal nodes, we'll just use those to navigate through the tree. In other words, we're going to have values in the internal nodes as well that will help us guide us to the right leaf node, but in every leaf node we're going to have an index entry; a value of the attribute we are indexing and along with a pointer to the data.
That will somehow manage to have the same depth for all of the leaves. In other words, the same number of edges to follow from that leaf up to the root.

B+ trees are going to have to do a little reorganization when we do an insertion or a deletion. Also we'll then have to do some propagating up through the tree all the way to the root. But since our height is guaranteed to be logarithmic in terms of the number of leaf nodes for the number of nodes in general, which is related to the number of the size of the table that we're indexing. These are always going to. Be pretty efficient.

B+ Tree in a Nutshell

Each node is going to occupy a block. Suppose that we choose an N so that we can fit N pointers and N-1 key values into a block. What that is depends on how big the attribute is that we're trying to index. Disk blocks are going to be typically 4KB.

The leaf nodes are going to have key values and pointers, but now the pointers are going to be pointing into the data file and we'll have one exactly pointer for each key value in a leaf. We're going to have one pointer left that's going to be used to chain the bottom, the leaf level of the tree together into a linked list, which is going to be handy for doing range searches.

Example

Compare: Trees with n-elements in leaves

Because these are not binary and they're also balanced, it's going to be logarithmic with a big base. In other words, a small number overall.

Suppose that we have a million( $10^6$ ) values that we want to index.

Binary Tree with N keys:

Let's say we have each internal nodes have 2 pointers and 1 key, which is going to be a $log_2N$ levels.

$log_2(10^6) = log_10(10^6) / log_10(2)=log_2(10^6) = 6 / 0.30103$ = approximately 19.9316.
Therefore, we have 20 levels in our tree.

Degree-100 tree:

Let's say we have each internal nodes have 100 pointers and 99 keys, which is going to be a $log_{100}N$ levels.

Let's make the assumption that each key is 4 bytes and each pointer is 8 bytes (which is common for 64-bit systems). The total bytes for each internal node would then be: (100 pointers * 8 bytes/pointer) + (99 keys * 4 bytes/key) = 800 + 396 = 1196 bytes. So each internal node is approximately 1200 bytes.
$log_{100}(10^6) = 2$ , however, since we are dealing with a tree data structure, we add 1 to include the root node, resulting in 3 levels.

Height and level in a tree structure

A tree with a height of 0 has 1 level (just the root), a tree with a height of 1 has 2 levels (the root and one level of children), and so on. So for a tree with a height of 2, we have 3 levels in total.

Remember that this data is going to be sitting on the disk and we're going to need to read it in and out. So we're going to be doing much better if we have to read 3 nodes of the tree rather than if we have to read 20 of them in terms of real clock time.

Conclusions:

Disk storage systems read and write data in blocks. These blocks are a fixed size, which could be several kilobytes (KB) to several megabytes (MB), depending on the system.
When you want to read data from disk, even if you only need 20 bytes, the disk has to read the entire block that contains those 20 bytes. This is because disk storage systems can't read individual bytes, they can only read whole blocks.
If you're using a binary tree, each node might be very small (perhaps only a few bytes). So, when you read a block from disk to get a single node, you're wasting most of the space in the block.
A B+ tree, on the other hand, has much larger nodes. Each node can contain many keys and pointers, so a single node might fill up an entire block. This means when you read a block from disk to get a node, you're using all the space in the block effectively, which makes the I/O operation more efficient.
The concept of "collapsing a binary tree into smaller levels" refers to how a B+ tree reduces the height of the tree (compared to a binary tree with the same number of elements). This is because each node in a B+ tree has more children, so you can store the same amount of data in fewer levels. This is beneficial because it reduces the number of disk accesses needed to find a particular key, which can significantly improve performance for disk-based storage systems.

In short, B+ trees are an optimization of binary trees that are designed to work efficiently with disk-based storage systems. They take advantage of the fact that disks read and write data in blocks by filling each block with a large amount of useful data (an entire node), and they reduce the number of disk accesses needed to find a key by "collapsing" the binary tree into fewer levels.

Last updated 2 years ago