Consistent hashing
Theory
Distributed Hash Tables (DHTs)
A hash table is a fundamental data structure that provides efficient key lookup operations. It uses a hash function to map a key to an index in an array where the value associated with is stored. The primary challenge in hash tables is dealing with collisions—situations where multiple keys hash to the same index.
Distributed dictionaries, or distributed hash tables (DHTs), are fundamental in managing key-value pairs across a cluster of computers. These structures are crucial in distributed computing frameworks like MapReduce and in applications such as distributed caching. In DHTs, we need to store objects across machines. The key requirements of DHTs are:
- Efficient Key Lookup: Given a key , determine which machine stores its associated value.
- Load Balancing: Ensure each machine holds approximately the same number of items/objects/pairs.
- Scalability: Efficiently handle the addition or removal of machines with minimal item redistribution.
A naive implementation of a DHT is to use a modulo-based method. Here, each key is assigned to a machine based on the modulo operation:
where is the number of machines. This method, while easy to implement, has several critical issues:
- Load Imbalance: The distribution of keys may not be uniform. Some machines may end up with significantly more keys than others, leading to load imbalance and inefficient utilization of resources.
- Scalability Problems: When the number of machines changes (due to addition or removal of machines), the modulo-based assignment requires a complete rehashing of all keys. This means every key potentially needs to be reassigned, causing significant data movement and system disruption.
Consistent Hashing
Consistent hashing, introduced by Karger et al. (1997) is a technique that addresses the challenges of distributing data across a cluster of machines in a way that minimizes disruption when nodes are added or removed. This method ensures that the distributed hash table (DHT) maintains a balanced load and minimizes the movement of data between machines, making it ideal for dynamic distributed systems.
Consistent hashing maps both objects and machines to points on a circular hash space, or hash ring. Consider a hash function that maps both machines and keys to the hash ring. The hash space is a circle of size . A good hash function distributes keys uniformly across the hash space. Cryptographic hash functions (e.g., SHA-256) are commonly used.
Hash Ring Construction:
- Hash each machine to a point on the ring using .
- Hash each key to a point on the ring using .
- A balanced binary search tree or a sorted list is used to store the positions of machines on the hash ring, allowing efficient lookup and updates.
Object Assignment:
- For each key , find the closest machine in the clockwise direction on the ring.
- The machine responsible for is the first machine encountered when moving clockwise from .
To improve load balancing, each physical machine can be assigned multiple virtual nodes. Virtual nodes are additional points on the hash ring for each machine. If a machine has virtual nodes, it is represented by different points on the ring .
Virtual Node Placement:
- Each virtual node is assigned a position on the ring using different hash values.
- This spreads the load more evenly across machines.
Load Balancing:
- With machines and keys, ideally, each machine should store keys.
- Using virtual nodes, each machine's load is closer to this ideal distribution, as it represents multiple points on the ring.
When a machine is added or removed, consistent hashing minimizes the disruption:
Adding a Machine:
- New machine is hashed to a point on the ring.
- Objects that were previously assigned to the machine immediately counterclockwise from are reassigned to .
Removing a Machine:
- Remove the machine's points from the ring.
- Objects that were assigned to the removed machine are reassigned to the next machine in the clockwise direction.
Applications of Consistent Hashing
Consistent hashing is vital in various distributed systems, such as:
Distributed Databases: Used in systems like Apache Cassandra and Amazon DynamoDB to evenly distribute data across nodes and handle dynamic scaling efficiently.
Load Balancing: Employed in load balancers to distribute requests evenly across servers, ensuring no single server becomes a bottleneck.
Content Delivery Networks (CDNs): Utilized to map user requests to the nearest cache servers, optimizing content delivery and minimizing latency.
Peer-to-Peer Networks: Fundamental in DHTs like Chord and Pastry, allowing efficient data lookup and storage across a decentralized network of peers.