Consistent hashing

Theory

Distributed Hash Tables (DHTs)

A hash table is a fundamental data structure that provides efficient key lookup operations. It uses a hash function h(k)h(k) to map a key kk to an index in an array where the value associated with kk is stored. The primary challenge in hash tables is dealing with collisions—situations where multiple keys hash to the same index.

Distributed dictionaries, or distributed hash tables (DHTs), are fundamental in managing key-value pairs across a cluster of computers. These structures are crucial in distributed computing frameworks like MapReduce and in applications such as distributed caching. In DHTs, we need to store mm objects across nn machines. The key requirements of DHTs are:

  1. Efficient Key Lookup: Given a key kk, determine which machine stores its associated value.
  2. Load Balancing: Ensure each machine holds approximately the same number of items/objects/pairs.
  3. Scalability: Efficiently handle the addition or removal of machines with minimal item redistribution.

A naive implementation of a DHT is to use a modulo-based method. Here, each key kk is assigned to a machine based on the modulo operation:

Machine=h(k)modn \text{Machine} = h(k) \mod n

where nn is the number of machines. This method, while easy to implement, has several critical issues:

  1. Load Imbalance: The distribution of keys may not be uniform. Some machines may end up with significantly more keys than others, leading to load imbalance and inefficient utilization of resources.
  2. Scalability Problems: When the number of machines nn changes (due to addition or removal of machines), the modulo-based assignment requires a complete rehashing of all keys. This means every key potentially needs to be reassigned, causing significant data movement and system disruption.

Consistent Hashing

Consistent hashing, introduced by Karger et al. (1997) is a technique that addresses the challenges of distributing data across a cluster of machines in a way that minimizes disruption when nodes are added or removed. This method ensures that the distributed hash table (DHT) maintains a balanced load and minimizes the movement of data between machines, making it ideal for dynamic distributed systems.


Consistent hashing maps both objects and machines to points on a circular hash space, or hash ring. Consider a hash function hh that maps both machines and keys to the hash ring. The hash space is a circle of size 2m2^m. A good hash function distributes keys uniformly across the hash space. Cryptographic hash functions (e.g., SHA-256) are commonly used.

  • Hash Ring Construction:

    • Hash each machine MiM_i to a point on the ring using h(Mi)h(M_i).
    • Hash each key kk to a point on the ring using h(k)h(k).
    • A balanced binary search tree or a sorted list is used to store the positions of machines on the hash ring, allowing efficient lookup and updates.
  • Object Assignment:

    • For each key kk, find the closest machine MiM_i in the clockwise direction on the ring.
    • The machine responsible for kk is the first machine encountered when moving clockwise from h(k)h(k).

To improve load balancing, each physical machine can be assigned multiple virtual nodes. Virtual nodes are additional points on the hash ring for each machine. If a machine MiM_i has vv virtual nodes, it is represented by vv different points on the ring {h(Mi1),h(Mi2),,h(Miv)}\{h(M_i^1), h(M_i^2), \ldots, h(M_i^v)\}.

  • Virtual Node Placement:

    • Each virtual node is assigned a position on the ring using different hash values.
    • This spreads the load more evenly across machines.
  • Load Balancing:

    • With nn machines and mm keys, ideally, each machine should store mn\frac{m}{n} keys.
    • Using virtual nodes, each machine's load is closer to this ideal distribution, as it represents multiple points on the ring.

When a machine is added or removed, consistent hashing minimizes the disruption:

  1. Adding a Machine:

    • New machine MnewM_{new} is hashed to a point on the ring.
    • Objects that were previously assigned to the machine immediately counterclockwise from h(Mnew)h(M_{new}) are reassigned to MnewM_{new}.
  2. Removing a Machine:

    • Remove the machine's points from the ring.
    • Objects that were assigned to the removed machine are reassigned to the next machine in the clockwise direction.

Applications of Consistent Hashing

Consistent hashing is vital in various distributed systems, such as:

  1. Distributed Databases: Used in systems like Apache Cassandra and Amazon DynamoDB to evenly distribute data across nodes and handle dynamic scaling efficiently.

  2. Load Balancing: Employed in load balancers to distribute requests evenly across servers, ensuring no single server becomes a bottleneck.

  3. Content Delivery Networks (CDNs): Utilized to map user requests to the nearest cache servers, optimizing content delivery and minimizing latency.

  4. Peer-to-Peer Networks: Fundamental in DHTs like Chord and Pastry, allowing efficient data lookup and storage across a decentralized network of peers.