Primary Index vs. Secondary Index

Clustering Index vs. Non-Clustering Index

Assume that this is a B+ tree because we've got some kind of ordering and that's irrelevant if it was a hash tables.

Primary (Clustering) Indexs

Primary or clustering index is where the relation is sorted according to one particular attribute.

From our leaves we don't have any crisscrosses in those pointers. In our leaf nodes, we're going to have smaller values to the left and larger values to the right as a property of B+ trees. Thus the leaves are going to end up being sorted and we're not going to get crisscrossing the pointers.

  • We're only going to be allowed to have one primary index for each table. The reason is that the data is sorted according to some attributes. Once it's sorted, that's the attribute that you can build a a primary index on. That's not 100% true because we can build a primary index on a pair of attributes and having it sorted. For example, you could have a primary index on XY and in principle a primary index on X as well. But, we'll assume that there's just one per table.

  • It may or may not be the primary key. Let's say I've got a table where the primary key is Social Security Number. But I would like to sort the table according to people's department names. If it's sorted according to department name, I can build an index on department name and that would be the primary index. But sometimes by default an index is built on the primary key.

Advantages

  1. We can do efficient range searches. Let's say that I want to find all the things that are between this value and some other value. Since the data is sorted, I can search for this value and then just start looking in the table until I get up to the top of the range. So I only need to search once in the index.

  1. It allows us to make the index sparse. In other words smaller index size. If we have a primary index, we don't need to create an index item for every single row (record) in the table. We can rely on the fact that the data is sorted to find things that are nearby. Let's say I happen to have 10,20 and 30 indexed, but I'm looking for 15. I could follow the pointer to find 10 and then 15 is going to be nearby.

The idea of "sparse index"

We don't create index entry for each row in table. We can only have pointers to row at start of a data blocks.

Why Primary Indexes?

If the table is already sorted, why not just do a binary search on the table?

Since the table is sorted, you might be tempted to do a binary search on the table. But, it's not all fitting in memory. It's not going to be I/O-Efficient to do binary search. Binary search is working with log base two, whereas B+ Tree index is working with higher fan out which allows us to get more information from a single I/O.

Difference in the base of the logarithm is important in this context.

Let's say every node here have enough space for two key values and three pointers. Each of our nodes is going to have at least two children. That's going to be the thing that allows us to balance it out. At these internal node levels, our pointers are pointing to nodes in the tree and at the leaf level they're pointing into the data file.

B+ Tree

For example, I'm searching for 9, I would say 9 is less than 11, so I go left. It is not less than 8 but it is less than 10. So I go middle. And I get to this leaf node, and we've actually got 2 records that have 9 associated with them. Each of them will have a pointer to a record in the data field that has a 9 in it.

It's not necessarily sorted on this attribute. It's possibly a secondary tree.

Sparse Clustered Index

Do we really need all index entries in Clustered Index?

No, if the data is sorted we can get away without indexing every single value. In other words, I have the opportunity to do a sparse index.

One way to do the sparse index is that for each disk page which is also known as a block have one entry in the data file in which will have a bunch of records.

Because I have fewer index entries and I have fewer values in the index, I may be able to have fewer levels in the tree that we use to build the index if this is a B+ tree index. Fewer, less overflow in the hash table if we're using the hash table. Since I have less blocks that are less nodes that I need to access, I'll have less I/O to get through the index. Typically in the sparse index context, people would have one index entry for each block of data.

We still need to search inside the blocks, but that essentially comes for free because once we read those blocks into memory, everything that we're doing in memory is extremely fast compared to doing an I/O.

Secondary (Non-Clustering) Indexs

In a secondary index, the leaves are still sorted, but now the data is not sorted in this files, so we'll get these pointers crisscrossing around. That's going to have some implications for certain kinds of searches that we can do and for whether we need to index every entry or not.

Last updated