Python Hashtable: How to Efficiently Use a Hashtable

[

Build a Hash Table in Python With TDD

by Bartosz Zaczyński

Get to Know the Hash Table Data Structure

Hash Table vs Dictionary
Hash Table: An Array With a Hash Function

Understand the Hash Function

Examine Python’s Built-in hash()
Dive Deeper Into Python’s hash()
Identify Hash Function Properties
Compare an Object’s Identity With Its Hash
Make Your Own Hash Function

Build a Hash Table Prototype in Python With TDD

Take a Crash Course in Test-Driven Development
Define a Custom HashTable Class
Insert a Key-Value Pair
Find a Value by Key
Delete a Key-Value Pair
Update the Value of an Existing Pair
Get the Key-Value Pairs
Use Defensive Copying
Get the Keys and Values
Report the Hash Table’s Length
Make the Hash Table Iterable
Represent the Hash Table in Text
Test the Equality of Hash Tables

Resolve Hash Code Collisions

Find Collided Keys Through Linear Probing
Use Linear Probing in the HashTable Class
Let the Hash Table Resize Automatically
Calculate the Load Factor
Isolate Collided Keys With Separate Chaining

Retain Insertion Order in a Hash Table

Use Hashable Keys

Hashability vs Immutability
The Hash-Equal Contract

Conclusion

Invented over half a century ago, the hash table is a classic data structure that has been fundamental to programming. To this day, it helps solve many real-life problems, such as indexing database tables, caching computed values, or implementing sets. It often comes up in job interviews, and Python uses hash tables all over the place to make name lookups almost instantaneous.

Even though Python comes with its own hash table called dict, it can be helpful to understand how hash tables work behind the curtain. A coding assessment may even task you with building one. This tutorial will walk you through the steps of implementing a hash table from scratch as if there were none in Python. Along the way, you’ll face a few challenges that’ll introduce important concepts and give you an idea of why hash tables are so fast.

In addition to this, you’ll get a hands-on crash course in test-driven development (TDD) and will actively practice it while building your hash table in a step-by-step fashion. You’re not required to have any prior experience with TDD, but at the same time, you won’t get bored even if you do!

In this tutorial, you’ll learn:

How a hash table differs from a dictionary
How you can implement a hash table from scratch in Python
How you can deal with hash collisions and other challenges
What the desired properties of a hash function are
How Python’s hash() works behind the scenes

It’ll help if you’re already familiar with Python dictionaries and have basic knowledge of object-oriented programming.

Now, let’s dive into the tutorial!

Get to Know the Hash Table Data Structure

Hash Table vs Dictionary

A hash table and a dictionary are similar in concept: both store data in key-value pairs. However, there are a few differences to keep in mind.

In Python, the built-in dict type is implemented as a hash table. It is highly optimized for performance and provides constant-time complexity for key lookups, insertions, and deletions on average. You can think of it as a black box that takes in keys and returns corresponding values.

On the other hand, a hash table is a more generic data structure that is not specific to any programming language. It is a collection of key-value pairs where keys are hashed to produce indices of an array, which is why access operations can be performed in constant time.

Hash Table: An Array With a Hash Function

A hash table is essentially an array of fixed size, where each element can be accessed using a unique key. The core idea behind a hash table is to use a hash function to map keys to indices in the underlying array. The hash function takes in a key and returns an integer, which is then used as an array index.

The goal is to minimize collisions, which occur when two or more keys produce the same hash value. However, collisions are inevitable due to the finite size of the array. There are different techniques to handle collisions, such as separate chaining and linear probing, which we will explore later in this tutorial.

Understand the Hash Function

Examine Python’s Built-in hash()

Python’s built-in hash() function can be used to compute the hash value of any object. It returns an integer representing the object’s hash value.

# Compute the hash value of an object
hash_value = hash(object)

The hash() function relies on the __hash__() method of the object being hashed. This method is responsible for generating a unique hash value based on the content of the object.

Dive Deeper Into Python’s hash()

The default implementation of the __hash__() method in Python covers most objects. However, it can be overridden for custom objects to ensure that objects with the same content have the same hash value. This is important for hash tables to function correctly.

# Override the __hash__() method for a custom object
def __hash__(self):
    return hash((self.attr1, self.attr2))

By default, objects of built-in types, such as integers, floats, strings, and tuples, are hashable. This means their hash values are computed based on their contents, and objects with the same contents will have the same hash value.

Identify Hash Function Properties

A good hash function for a hash table should have the following properties:

Deterministic: Given the same input, the hash function should always produce the same hash value.
Uniform distribution: The hash values should be uniformly distributed across a wide range. This reduces collisions and improves performance.
Fast computation: The hash function should be computationally cheap to calculate.

It’s important to note that the hash function doesn’t need to be perfectly unique for every object. Collisions are acceptable as long as they are rare and efficiently handled.

Compare an Object’s Identity With Its Hash

In Python, you can compare an object’s identity with its hash value using the id() function and the hash() function. The id() function returns a unique identifier for an object, while the hash() function returns the object’s hash value.

# Compare an object's identity and hash value
object = "example"
print(id(object))
print(hash(object))

If two objects have the same hash value, it doesn’t necessarily mean they are the same object. It’s possible to have hash collisions, where different objects produce the same hash value. However, if two objects have different hash values, they are guaranteed to be different objects.

Make Your Own Hash Function

You can create your own hash function by implementing the __hash__() method for a custom class. The __hash__() method should return an integer representing the hash value of the object.

# Implementing a custom hash function
class MyClass:
    def __init__(self, attr1, attr2):
        self.attr1 = attr1
        self.attr2 = attr2

    def __hash__(self):
        return hash((self.attr1, self.attr2))

In this example, the hash() function is used to compute the hash value based on the attributes attr1 and attr2 of the MyClass objects. This ensures that objects with the same attribute values will have the same hash value.

Build a Hash Table Prototype in Python With TDD

In this section, you will learn how to build a prototype of a hash table in Python using test-driven development (TDD) techniques. Test-driven development involves writing tests before writing the actual implementation code.

Take a Crash Course in Test-Driven Development

Test-driven development (TDD) is a development technique that involves writing tests before writing the implementation code. This helps ensure that the code meets the expected requirements and provides a safety net for refactoring.

To get started with TDD, you need to define the requirements of the hash table and write tests to verify that the implementation meets those requirements.

Define a Custom HashTable Class

First, you need to define a custom HashTable class that will represent the hash table. This class will have methods for inserting, finding, deleting, updating, and retrieving key-value pairs from the hash table.

class HashTable:
    def __init__(self):
        # Initialize the hash table with an array and size
        pass

    def insert(self, key, value):
        # Insert a key-value pair into the hash table
        pass

    def find(self, key):
        # Find a value by key in the hash table
        pass

    def delete(self, key):
        # Delete a key-value pair from the hash table
        pass

    def update(self, key, value):
        # Update the value of an existing key-value pair in the hash table
        pass

    def get_pairs(self):
        # Get all key-value pairs from the hash table
        pass

    def get_keys(self):
        # Get all keys from the hash table
        pass

    def get_values(self):
        # Get all values from the hash table
        pass

    def report_length(self):
        # Report the length of the hash table
        pass

    def make_iterable(self):
        # Make the hash table iterable
        pass

    def represent_text(self):
        # Represent the hash table in text format
        pass

    def test_equality(self, other):
        # Test the equality of two hash tables
        pass

The HashTable class has empty method stubs for each functionality that will be implemented later.

Insert a Key-Value Pair

The insert() method will allow you to insert a key-value pair into the hash table. It will use the hash function to determine the index in the array where the key-value pair will be stored.

def insert(self, key, value):
    index = self.hash_function(key)
    # Insert the key-value pair at the calculated index
    # Handle collisions if necessary

The hash_function() method will take a key as input and return the hash value.

Find a Value by Key

The find() method will allow you to find a value by key in the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.

def find(self, key):
    index = self.hash_function(key)
    # Search for the key-value pair at the calculated index
    # Handle collisions if necessary
    # Return the value if found, or raise an exception if not found

The hash_function() method will be the same as used in the insert() method.

Delete a Key-Value Pair

The delete() method will allow you to delete a key-value pair from the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.

def delete(self, key):
    index = self.hash_function(key)
    # Delete the key-value pair at the calculated index
    # Handle collisions if necessary
    # Raise an exception if the key-value pair is not found

The hash_function() method will be the same as used in the insert() and find() methods.

Update the Value of an Existing Pair

The update() method will allow you to update the value of an existing key-value pair in the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.

def update(self, key, value):
    index = self.hash_function(key)
    # Find the key-value pair at the calculated index
    # Update the value if found, or raise an exception if not found

The hash_function() method will be the same as used in the insert(), find(), and delete() methods.

Get the Key-Value Pairs

The get_pairs() method will allow you to retrieve all key-value pairs from the hash table. It will return a list of tuples, where each tuple represents a key-value pair.

def get_pairs(self):
    # Return a list of all key-value pairs in the hash table

Use Defensive Copying

To protect the integrity of the hash table, it is important to use defensive copying when returning the key-value pairs, keys, or values from the hash table. This will prevent modifications to the internal data structures of the hash table.

Get the Keys and Values

The get_keys() and get_values() methods will allow you to retrieve all keys and values from the hash table, respectively. They will return lists of keys and values.

def get_keys(self):
    # Return a list of all keys in the hash table

def get_values(self):
    # Return a list of all values in the hash table

Report the Hash Table’s Length

The report_length() method will allow you to report the length of the hash table, which is the number of key-value pairs stored in it.

def report_length(self):
    # Return the length of the hash table

Make the Hash Table Iterable

The make_iterable() method will allow you to make the hash table iterable, so that you can iterate over key-value pairs using a for loop.

def make_iterable(self):
    # Make the hash table iterable

Represent the Hash Table in Text

The represent_text() method will allow you to represent the hash table in a human-readable text format.

def represent_text(self):
    # Return a string representation of the hash table

Test the Equality of Hash Tables

The test_equality() method will allow you to test the equality of two hash tables. It will return True if the hash tables have the same key-value pairs and False otherwise.

def test_equality(self, other):
    # Test the equality of two hash tables

Resolve Hash Code Collisions

Find Collided Keys Through Linear Probing

Linear probing is a technique to resolve hash code collisions in a hash table. When a collision occurs, instead of storing the key-value pair at the calculated index, you sequentially search for the next available empty slot in the array until you find one.

def insert(self, key, value):
    index = self.hash_function(key)
    # Search for the next available slot after the calculated index
    # Insert the key-value pair at the found slot
    # Handle collisions if necessary

When searching for an available slot, you can use a loop that increments the index by 1 until it finds an empty slot in the array.

Use Linear Probing in the HashTable Class

To implement linear probing in the HashTable class, you can modify the insert() method to handle collisions and search for the next available slot in the array.

def insert(self, key, value):
    index = self.hash_function(key)
    while self.array[index] is not None:
        # Move to the next index until an empty slot is found
    # Insert the key-value pair at the found slot

You will need to update the insert() method to handle collisions properly and implement the logic to search for the next available slot in the array.

Let the Hash Table Resize Automatically

To handle collisions and maintain a good performance, you can let the hash table resize automatically when the load factor exceeds a certain threshold. The load factor represents the fill level of the hash table and is calculated as the number of key-value pairs divided by the size of the array.

To resize the hash table, you can create a new array with a larger size and rehash all the key-value pairs into the new array.

Calculate the Load Factor

The calculate_load_factor() method can be implemented to calculate the load factor of the hash table.

def calculate_load_factor(self):
    # Calculate the load factor of the hash table

The load factor is calculated as the number of key-value pairs divided by the size of the array.

Isolate Collided Keys With Separate Chaining

Separate chaining is another technique to resolve hash code collisions in a hash table. Instead of storing the key-value pairs directly in the array, a separate linked list is created for each index where collisions occur. This way, each linked list represents a group of collided keys.

def insert(self, key, value):
    index = self.hash_function(key)
    # Insert the key-value pair at the calculated index
    # Handle collisions by appending the key-value pair to the linked list

When a collision occurs, you can create a new node for the current key-value pair and append it to the linked list.

Retain Insertion Order in a Hash Table

To retain the insertion order of key-value pairs in a hash table, you can use an additional linked list or array to keep track of the order in which the key-value pairs are inserted. This way, you can iterate over the key-value pairs in the order they were inserted.

Use Hashable Keys

Hashability vs Immutability

In order for keys to be used in a hash table, they must be hashable. This means they must have a defined __hash__() method that returns a unique hash value based on the key’s content. In addition, hashable keys should be immutable, meaning that their content cannot be changed after they are created.

The Hash-Equal Contract

Hashable keys must also follow the hash-equal contract, which states that two objects that are equal must have the same hash value. However, two objects with the same hash value are not necessarily equal.

To ensure that custom keys are hashable, you need to implement the __hash__() method for the key class. This method should return a hash value that is based on the content of the key.

Conclusion

In this tutorial, you have learned the basic concepts and implementation details of a hash table in Python. You have seen how a hash table differs from a dictionary, how the hash function works, and how to implement a hash table using test-driven development techniques.

You have also explored techniques to resolve hash code collisions, including linear probing and separate chaining. Additionally, you have learned about the importance of hashable keys and the hash-equal contract.

By understanding the inner workings of a hash table, you have gained insights into how Python’s dict type and other hash table-based data structures work behind the scenes. This knowledge can be valuable in solving real-life programming problems, as well as in technical interviews.

Now that you have a strong foundation in hash tables, you are ready to explore more advanced topics, such as hash table optimizations, handling large data sets, and integrating hash tables with other data structures.

Happy coding!