Dense Index vs. Sparse Index

Dense Index Files

In dense index, index record appears for every search-key value in the file.

Example: index on ID attribute of instructor relation

Suppose we're indexing them by their IDs. Our index is going to have IDs and pointers, whereas the data file is going to have the IDs as well as all the other attributes - the last the name, the department, the salaries, etc.

Suppose that only two of these records fit on each block. We would have block one having our first two records, Block 2 has Mozart and Einstein, Block 3 has El Said and Gold, etc. But our index entries are smaller, so we can fit more of them onto a block, so a single block is going to be the frist six IDs for example.

If I was looking for 83821, instead of searching through from Block 1 to Block 6, I'll have two index blocks to look at which will take me to the block where that person Brant has their record.

However, using an index has its trade-offs. While it can speed up data retrieval, it might also introduce additional seeks. For example, you might first seek to the index to find out where the data is and then seek again to the actual data location. In some cases, especially when data is read sequentially, it might be faster to read the data without using an index.

While indexes can speed up data retrieval, they also introduce additional seeks.

Depending on the specific use case and the performance characteristics of the storage device (like seek time vs. transfer time), it might not always be beneficial to use an index. The decision to use an index should be based on a thorough understanding of the data access patterns and the performance characteristics of the underlying storage.

Terminology

Contiguous Data Blocks:

When data is stored contiguously, it means that the data blocks are next to each other on the storage device. This is beneficial because once the read/write head of the storage device reaches the starting block, it can continue reading/writing the subsequent blocks without moving too much.

Seeks:

A "seek" is the action of moving the read/write head of a storage device to a specific location. Seeks are relatively slow operations because they involve physical movement. The time taken for a seek can vary based on how far the head needs to move.

Transfers:

Once the head is in the right position (after a seek), transferring data (reading or writing) is faster than seeking. However, it's still slower than operations in main memory (like RAM).

Latency Model:

This is a model used to evaluate the performance of algorithms based on the time it takes to access data. In this context, the latency refers to the delay in accessing data from storage. The main contributors to this latency are seeks and transfers.

So one thing that will help is, if I could have sparsity instead, that's going to make my index a little bit smaller.

Example: multiple records that have the same value for the attribute

If multiple records have the same value for an indexed attribute, then you can achieve savings in the size of the index. Instead of having a separate index entry for each record, you can have a single index entry pointing to the first occurrence of a particular value, and then scan sequentially from there to find all records with that value. This reduces the size of the index compared to the size of the data file.

Notice that this is a "dense primary index". A primary index is an index on a set of fields that includes the primary key. Since the data is sorted by the department name, and the index is also on the department name, it's a dense primary index.

In summary, the main point is about the efficiency of dense indexes, especially when multiple records share the same value for an indexed attribute. In such cases, the index can be much smaller than the data file, leading to savings in storage and potentially faster query performance.

Spare Index Files

Spare index contains index records for only some search-key values.

  • Applicable when records are sequentially ordered on search-key

  • To locate a record with search-key value K we:

    • Find index record with largest search-key valye < K

    • Search file sequentially starting at the record to which the index record points

Example

Suppose that I'm searching for 22222. There isn't an entry for 22222. It's greater than equal to 10101 and less than 32343. I'll follow the record ID pointer that goes with the largest index entry that's less than or equal to the ID that I'm searching for - 10101.

Within the data block, records are sorted. So if I just start looking sequentially from 10101, I will eventually get to what I'm looking for, 22222. The block that I fetched to search for 22222 would be one block: from 10101 to 22222

Once the next block (from 32343 to 76543) is in memory, I look through it quickly because it's in memory, but I don't care about what sorting algorithms to use like a binary search or a sequential search since once it's in memory, it's very fast compared to the I/O.

Last updated