Python Hashtable: How to Efficiently Use a Hashtable
Build a Hash Table in Python With TDD
by Bartosz Zaczyński
Get to Know the Hash Table Data Structure
- Hash Table vs Dictionary
- Hash Table: An Array With a Hash Function
Understand the Hash Function
- Examine Python’s Built-in hash()
- Dive Deeper Into Python’s hash()
- Identify Hash Function Properties
- Compare an Object’s Identity With Its Hash
- Make Your Own Hash Function
Build a Hash Table Prototype in Python With TDD
- Take a Crash Course in Test-Driven Development
- Define a Custom HashTable Class
- Insert a Key-Value Pair
- Find a Value by Key
- Delete a Key-Value Pair
- Update the Value of an Existing Pair
- Get the Key-Value Pairs
- Use Defensive Copying
- Get the Keys and Values
- Report the Hash Table’s Length
- Make the Hash Table Iterable
- Represent the Hash Table in Text
- Test the Equality of Hash Tables
Resolve Hash Code Collisions
- Find Collided Keys Through Linear Probing
- Use Linear Probing in the HashTable Class
- Let the Hash Table Resize Automatically
- Calculate the Load Factor
- Isolate Collided Keys With Separate Chaining
Retain Insertion Order in a Hash Table
Use Hashable Keys
- Hashability vs Immutability
- The Hash-Equal Contract
Conclusion
Invented over half a century ago, the hash table is a classic data structure that has been fundamental to programming. To this day, it helps solve many real-life problems, such as indexing database tables, caching computed values, or implementing sets. It often comes up in job interviews, and Python uses hash tables all over the place to make name lookups almost instantaneous.
Even though Python comes with its own hash table called dict
, it can be helpful to understand how hash tables work behind the curtain. A coding assessment may even task you with building one. This tutorial will walk you through the steps of implementing a hash table from scratch as if there were none in Python. Along the way, you’ll face a few challenges that’ll introduce important concepts and give you an idea of why hash tables are so fast.
In addition to this, you’ll get a hands-on crash course in test-driven development (TDD) and will actively practice it while building your hash table in a step-by-step fashion. You’re not required to have any prior experience with TDD, but at the same time, you won’t get bored even if you do!
In this tutorial, you’ll learn:
- How a hash table differs from a dictionary
- How you can implement a hash table from scratch in Python
- How you can deal with hash collisions and other challenges
- What the desired properties of a hash function are
- How Python’s
hash()
works behind the scenes
It’ll help if you’re already familiar with Python dictionaries and have basic knowledge of object-oriented programming.
Now, let’s dive into the tutorial!
Get to Know the Hash Table Data Structure
Hash Table vs Dictionary
A hash table and a dictionary are similar in concept: both store data in key-value pairs. However, there are a few differences to keep in mind.
In Python, the built-in dict
type is implemented as a hash table. It is highly optimized for performance and provides constant-time complexity for key lookups, insertions, and deletions on average. You can think of it as a black box that takes in keys and returns corresponding values.
On the other hand, a hash table is a more generic data structure that is not specific to any programming language. It is a collection of key-value pairs where keys are hashed to produce indices of an array, which is why access operations can be performed in constant time.
Hash Table: An Array With a Hash Function
A hash table is essentially an array of fixed size, where each element can be accessed using a unique key. The core idea behind a hash table is to use a hash function to map keys to indices in the underlying array. The hash function takes in a key and returns an integer, which is then used as an array index.
The goal is to minimize collisions, which occur when two or more keys produce the same hash value. However, collisions are inevitable due to the finite size of the array. There are different techniques to handle collisions, such as separate chaining and linear probing, which we will explore later in this tutorial.
Understand the Hash Function
Examine Python’s Built-in hash()
Python’s built-in hash()
function can be used to compute the hash value of any object. It returns an integer representing the object’s hash value.
The hash()
function relies on the __hash__()
method of the object being hashed. This method is responsible for generating a unique hash value based on the content of the object.
Dive Deeper Into Python’s hash()
The default implementation of the __hash__()
method in Python covers most objects. However, it can be overridden for custom objects to ensure that objects with the same content have the same hash value. This is important for hash tables to function correctly.
By default, objects of built-in types, such as integers, floats, strings, and tuples, are hashable. This means their hash values are computed based on their contents, and objects with the same contents will have the same hash value.
Identify Hash Function Properties
A good hash function for a hash table should have the following properties:
- Deterministic: Given the same input, the hash function should always produce the same hash value.
- Uniform distribution: The hash values should be uniformly distributed across a wide range. This reduces collisions and improves performance.
- Fast computation: The hash function should be computationally cheap to calculate.
It’s important to note that the hash function doesn’t need to be perfectly unique for every object. Collisions are acceptable as long as they are rare and efficiently handled.
Compare an Object’s Identity With Its Hash
In Python, you can compare an object’s identity with its hash value using the id()
function and the hash()
function. The id()
function returns a unique identifier for an object, while the hash()
function returns the object’s hash value.
If two objects have the same hash value, it doesn’t necessarily mean they are the same object. It’s possible to have hash collisions, where different objects produce the same hash value. However, if two objects have different hash values, they are guaranteed to be different objects.
Make Your Own Hash Function
You can create your own hash function by implementing the __hash__()
method for a custom class. The __hash__()
method should return an integer representing the hash value of the object.
In this example, the hash()
function is used to compute the hash value based on the attributes attr1
and attr2
of the MyClass
objects. This ensures that objects with the same attribute values will have the same hash value.
Build a Hash Table Prototype in Python With TDD
In this section, you will learn how to build a prototype of a hash table in Python using test-driven development (TDD) techniques. Test-driven development involves writing tests before writing the actual implementation code.
Take a Crash Course in Test-Driven Development
Test-driven development (TDD) is a development technique that involves writing tests before writing the implementation code. This helps ensure that the code meets the expected requirements and provides a safety net for refactoring.
To get started with TDD, you need to define the requirements of the hash table and write tests to verify that the implementation meets those requirements.
Define a Custom HashTable Class
First, you need to define a custom HashTable
class that will represent the hash table. This class will have methods for inserting, finding, deleting, updating, and retrieving key-value pairs from the hash table.
The HashTable
class has empty method stubs for each functionality that will be implemented later.
Insert a Key-Value Pair
The insert()
method will allow you to insert a key-value pair into the hash table. It will use the hash function to determine the index in the array where the key-value pair will be stored.
The hash_function()
method will take a key as input and return the hash value.
Find a Value by Key
The find()
method will allow you to find a value by key in the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.
The hash_function()
method will be the same as used in the insert()
method.
Delete a Key-Value Pair
The delete()
method will allow you to delete a key-value pair from the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.
The hash_function()
method will be the same as used in the insert()
and find()
methods.
Update the Value of an Existing Pair
The update()
method will allow you to update the value of an existing key-value pair in the hash table. It will use the hash function to determine the index in the array where the key-value pair might be stored.
The hash_function()
method will be the same as used in the insert()
, find()
, and delete()
methods.
Get the Key-Value Pairs
The get_pairs()
method will allow you to retrieve all key-value pairs from the hash table. It will return a list of tuples, where each tuple represents a key-value pair.
Use Defensive Copying
To protect the integrity of the hash table, it is important to use defensive copying when returning the key-value pairs, keys, or values from the hash table. This will prevent modifications to the internal data structures of the hash table.
Get the Keys and Values
The get_keys()
and get_values()
methods will allow you to retrieve all keys and values from the hash table, respectively. They will return lists of keys and values.
Report the Hash Table’s Length
The report_length()
method will allow you to report the length of the hash table, which is the number of key-value pairs stored in it.
Make the Hash Table Iterable
The make_iterable()
method will allow you to make the hash table iterable, so that you can iterate over key-value pairs using a for
loop.
Represent the Hash Table in Text
The represent_text()
method will allow you to represent the hash table in a human-readable text format.
Test the Equality of Hash Tables
The test_equality()
method will allow you to test the equality of two hash tables. It will return True
if the hash tables have the same key-value pairs and False
otherwise.
Resolve Hash Code Collisions
Find Collided Keys Through Linear Probing
Linear probing is a technique to resolve hash code collisions in a hash table. When a collision occurs, instead of storing the key-value pair at the calculated index, you sequentially search for the next available empty slot in the array until you find one.
When searching for an available slot, you can use a loop that increments the index by 1 until it finds an empty slot in the array.
Use Linear Probing in the HashTable Class
To implement linear probing in the HashTable
class, you can modify the insert()
method to handle collisions and search for the next available slot in the array.
You will need to update the insert()
method to handle collisions properly and implement the logic to search for the next available slot in the array.
Let the Hash Table Resize Automatically
To handle collisions and maintain a good performance, you can let the hash table resize automatically when the load factor exceeds a certain threshold. The load factor represents the fill level of the hash table and is calculated as the number of key-value pairs divided by the size of the array.
To resize the hash table, you can create a new array with a larger size and rehash all the key-value pairs into the new array.
Calculate the Load Factor
The calculate_load_factor()
method can be implemented to calculate the load factor of the hash table.
The load factor is calculated as the number of key-value pairs divided by the size of the array.
Isolate Collided Keys With Separate Chaining
Separate chaining is another technique to resolve hash code collisions in a hash table. Instead of storing the key-value pairs directly in the array, a separate linked list is created for each index where collisions occur. This way, each linked list represents a group of collided keys.
When a collision occurs, you can create a new node for the current key-value pair and append it to the linked list.
Retain Insertion Order in a Hash Table
To retain the insertion order of key-value pairs in a hash table, you can use an additional linked list or array to keep track of the order in which the key-value pairs are inserted. This way, you can iterate over the key-value pairs in the order they were inserted.
Use Hashable Keys
Hashability vs Immutability
In order for keys to be used in a hash table, they must be hashable. This means they must have a defined __hash__()
method that returns a unique hash value based on the key’s content. In addition, hashable keys should be immutable, meaning that their content cannot be changed after they are created.
The Hash-Equal Contract
Hashable keys must also follow the hash-equal contract, which states that two objects that are equal must have the same hash value. However, two objects with the same hash value are not necessarily equal.
To ensure that custom keys are hashable, you need to implement the __hash__()
method for the key class. This method should return a hash value that is based on the content of the key.
Conclusion
In this tutorial, you have learned the basic concepts and implementation details of a hash table in Python. You have seen how a hash table differs from a dictionary, how the hash function works, and how to implement a hash table using test-driven development techniques.
You have also explored techniques to resolve hash code collisions, including linear probing and separate chaining. Additionally, you have learned about the importance of hashable keys and the hash-equal contract.
By understanding the inner workings of a hash table, you have gained insights into how Python’s dict
type and other hash table-based data structures work behind the scenes. This knowledge can be valuable in solving real-life programming problems, as well as in technical interviews.
Now that you have a strong foundation in hash tables, you are ready to explore more advanced topics, such as hash table optimizations, handling large data sets, and integrating hash tables with other data structures.
Happy coding!