Hash Table

A hash table, also known as a hash map, is a data structure that implements an associative array abstract data type (ADT) that can map keys to values.

The most notable feature of the hash table is that most of its operations have a time complexity of O(1) according to amortized analysis.
The use of a hash function to index the hash table is called hashing.
Hashing is an essential technique used to store and retrieve information as quickly as possible.

Load Factor

The Load Factor is the ratio of the number of data items stored in the hash table to the number of buckets $k$ .

\text{load factor}=\frac{n}{k}

Generally, as the Load Factor increases, the performance of the hash table tends to decrease.

The Load Factor ratio determines whether to rewrite the hash function or adjust the size of the hash table. This value is also used to measure the efficiency of how well the hash function distributes keys.

In Java 10, the default Load Factor for a hashmap is set to 0.75.
Python's Load Factor is 0.66, which is smaller than Java, while Ruby's is even smaller at 0.5.

Hash Function

A hash function is a function that maps data of arbitrary size to a fixed-size value.

Let's take a look at the Modulo-Division Method, one of the most straightforward and widely used hashing algorithms.

The Modulo-Division Method is a simple hashing technique where the hash value is calculated by taking the modulus of the key with the table size. In mathematical terms, the hash function can be expressed as:

ℎ(k)=k\mod {m}

where $h(k)$ is the hash value, $k$ is the key, and $m$ is the size of the table or the number of buckets. In this context, $k$ represents a sufficiently random value of the key, which is produced through some simple rules.

The goal is to ensure that the key $k$ is distributed uniformly across the range to minimize the chances of collisions in the hash table. For example, if we're hashing strings, a simple rule could involve converting each character in the string to its ASCII value, then summing these values to produce a "random" key $k$ . This key $k$ can then be used with the Modulo-Division Method to find an appropriate slot in the hash table.

However, the process to generate $k$ should be chosen carefully. If the method to derive $k$ is not well-designed, it could result in an uneven distribution, leading to more collisions and reducing the efficiency of the hash table.

The primary advantage of the Modulo-Division Method is its simplicity and ease of implementation. However, the choice of $m$ is crucial. If $m$ is not chosen properly, it can lead to a high number of collisions. Generally, it's recommended to choose $m$ as a prime number that isn't too close to a power of 2 or 10 to achieve a good distribution of keys and reduce collisions.

Collision

Collisions occur more often than one might think, so it's crucial to minimize them. This is because multiple collisions require additional operations.

No matter how good a hash function is, collisions will occur. Let's take a look at how we handle collisions when they arise.

Seperate Chaining

When a collision occurs, it is handled by linking with a linked list as illustrated in this diagram below:

Compute the hash value of the key.
Use the hash value to determine the array index.
If the same index exists, link it with a linked list.

Open Addressing

Open addressing is a method where, upon collision, probing is used to find an empty space.

In the open addressing method, no more than the total number of slots can be stored. Therefore, when a collision occurs, there's no guarantee that all elements will necessarily be stored in an address that matches their hash value since it resolves the collision by finding an empty space within the table space using probing.
For example, in linear probing, clustering occurs, which means there are groups of continuous data scattered around the hash table. This phenomenon makes the probing time longer, reducing the overall hashing efficiency.
Therefore, when it fills up beyond a certain point, or exceeds the benchmark load factor, rehashing is done based on the growth factor. A new, larger bucket is created, and data is copied over. This process is similar to how dynamic arrays copy and move data to a larger array when the current array is full.

Comparison of Hash Table Implementation by Language

Python's dictionary is a data type implemented as a hash table, using the open addressing method to resolve hash table collisions.

Why open addressing?

Creating a linked list requires additional memory allocation, and memory allocation is a relatively slow operation.
Linear probing, one method of open addressing, typically has better performance than chaining. However, when more than 80% of the slots are filled, there is a sharp performance degradation. Unlike chaining, with open addressing, you cannot store more than the total number of available slots, i.e., a load factor greater than 1.

Hash table implementation for each language

Modern languages opt for the open addressing method to enhance performance, but they set a smaller load factor to mitigate performance drop issues.

Language

Implementation

C++(GC libstdc++)

Seperate Chaining

Java

Seperate Chaining

Ruby

Open Addressing

Python

Open Addressing

Last updated 2 years ago