Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consistent hashing as a way to scale writes

I am trying to figure out if I am on the right track. I am building a (real-time) statistics/analytics service and I use redis to store some sets and hashes.

Now let's assume I have some success and I need to scale out. The hash ring technique looks nice, but I have an impression that it is only suited for caching scenarios.

What if a node goes down? In theory, its keys are now owned by other nodes. In practice, they won't have the data. It is lost, right? Same with adding / removing nodes.

Am I missing some fundamental thing? Can this be a poor man's cluster?

like image 579
Sergio Tulentsev Avatar asked Apr 12 '11 21:04

Sergio Tulentsev


People also ask

What is consistent hashing used for?

Consistent hashing is used in distributed systems to keep the hash table independent of the number of servers available to minimize key relocation when changes of scale occur.

How does consistent hashing works and how would you apply and scale it for a system?

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

How can hash functions help distribute writes over multiple servers?

Distributed Hashing: In that situation, we will try to distribute the hash table to multiple servers to avoid memory limitation of one server. Objects (and their keys) are distributed among several servers. This kind of setup is very common for in-memory caches like Memcached, Redis etc.

How does consistent hashing work in Cassandra?

Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are added or removed. Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are added or removed. Consistent hashing partitions data based on the partition key.


1 Answers

There are two reasons to use multiple nodes in a cluster:

  • Sharding to limit the amount of data stored on each node
  • Duplication to reduce read load and allow a node to be removed without data loss.

The two are fundamentally different, but you can implement both - use consistent hashing to point to a set of nodes with a standard master/slave setup rather than a single node.

If the cluster is your primary data store rather than a cache, you will need a different redistribution strategy that includes copying the data.

My implementation is based on having the client choose one of 64k buckets for a hash and having a table that maps that bucket to a node. Initially, all map to node #1.

When node #1 gets too large, its slave becomes master node #2 and the table is updated to map half of the node #1 keys to node #2. At this point all reads and writes will work with the new mapping and you just need to clean up the keys that are now on the wrong node. Depending on the performance requirements, you can check all keys at once or check a random selection of keys as the expiry system does.

like image 181
Tom Clarkson Avatar answered Sep 21 '22 01:09

Tom Clarkson