Index Structures

Various types of indexes

1. Clustered (Primary) or Unclustered (Secondary)

This refers to whether the data in the underlying table is sorted according to the attribute that we're in the indexing.

2. Sparse or Dense

That has to do with whether we have a index entry for every single record in the table, or in the case of dense if we only have index entries for some of them it's sparse.

If it's dense, every record in the data file or every value of the attribute in question that we're looking for such as Social Security Number in the data file is listed in the index.
Sparse means that we don't list all of the values. But we have some way of finding the unlisted values based on some of the ones that are listed in the index.

3. Single Key or Composite or Multi-Dimensional

Whether we're looking at just a single attribute in our index say, is it just an index on ages of people or is it on multiple attributes we could index on, say age and zip code. If we expect to have a lot of queries where we're looking for people with a particular age within some range and a ZIP code.

There are contexts such as geographical information systems where we might have two or higher dimensional data sets for example. We might want efficient ways of finding a point on a plane or a city on map. However, we're not covering for such multidimensional data here.

Different data structures to implement indexes

Tree Structures

They're a class of ordered indices in which we're taking advantage of ordering on the data that. For example, one age is less than another age numerically, or one name is less than another name in alphabetical order.

Hashing

Different family of indexes that does not rely on the order among elements is a hash table. We don't use any kind of comparison and we are only looking for equality among things.

Index entries

An index entry is a pair of key and pointer

Keys could be a value of some tuple of attributes like age and zip code that we're going to be searching for. Pointers are telling us which block on the disk that the record for that lives in. Actually the pointers might not be a physical pointer like that. There's usually some more indirection that we'll have some kind of an identifier called record ID telling us how to find the record on the disk that has the particular key value that we're looking for. Every object in our database is going to have a record ID.

Index Construction

Make an index entry for each tuple in the table
Insert index entries into index structures (B+ tree, Hash Table)

The index is going to be based on the values that are in the table at the time that we build the index. So we'll have an entry for every element that's in the table. If we do insertions into the table, we want to update the index as well, and if we do deletions from the table we want to update the index.

Example: schema is (age, name) with index on age

For example, we'll have the record (8, Tom) and we're going to convert into a an index entry that has the key 8 and a pointer to the record for (8, Tom).
We'll take all of those index entires and put them into some kind of an index structure.

In this kind of structure was assuming that this works if our key values are unique. If there were Social Security Numbers, that would be a realistic assumption.

Sneak Peek at Query Processing

How much data we're reading in?

The data is in a table on the disk and the indices are in some blocks on the disk, but we need to read everything into main memory or RAM before we can work.

Note: block is a smallest unit of data transferred often 4K bytes (8 sectors).

SELECT fname, lname FROM Person WHERE ID = 12345

Now, assume that the data table is stored on the disk in contiguous blocks and within the data file, various file layouts are possible.

If we're not using an index, we would have one random access (seek time) plus rotational latency to find the first block. If there's blocks of data in the Person table, then we will need to transfer all of blocks.

Basic Setup

We're not going to be reorganizing the data as part of building the index.

Types of Data Structures Used

We're just separating out the index leafs at the end if it's a B+ tree, if it was a hash table, it would be the key values and pointers in the buckets of the hash table. But we actually need another level of indirection before we can get to the data in the case that we have multiple values for an attribute.

For example, we're looking for the name of a person who has a particular Social Security Number. The Social Security number would be used to navigate our way through the index, and then we'll go down to the actual data table (db relation) to find that record and find the person's name.

In which case the picture would still look a lot like above, but beneath the leaves we would have something like where we'd have a bucket of pointers to everybody whose age is 6.

Index vs. File Organization:

What we're going to be looking at is an index organization that's separate from the file.

You could use it either as an index where the data is the actual data with the other attributes in each record is stored in a separate file, or you could also use this as a way of organizing the actual data table file.

Index organization where our data table is sitting on the disk in a bunch of contiguous blocks. Index is typically scattered around in some other blocks on the disk. We have to get down to navigate through the index to the leaf level and follow a pointer to the data that we're looking for.
The other option is to actually have the data, in other words, the other attributes as well as the ones that we are keying in living in the tree along with index for navigating to find the right leaf.

Where is the index stored?

If the index is small enough, it could be stored in memory. But the realistic examples, the index is also going to be stored on the disk. Furthermore, it's going to be spread out over the disk where blocks that are not contiguous.

Index Evaluation Metrics

Access time
Insertion time
Deletion time
Space overhead

Last updated 2 years ago