Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple basic explanation of a Distributed Hash Table (DHT)

Tags:

theory

p2p

dht

People also ask

How do distributed hash tables work?

A Distributed Hash Table is a decentralized data store that looks up data based on key-value pairs. Every node in a distributed hash table is responsible for a set of keys and their associated values. The key is a unique identifier for its associated data value, created by running the value through a hashing function.

What is DHT data?

A distributed hash table (DHT) is a decentralized storage system that provides lookup and storage schemes similar to a hash table, storing key-value pairs. Each node in a DHT is responsible for keys along with the mapped values. Any node can efficiently retrieve the value associated with a given key.

What is DHT protocol?

The DHT protocol (Kademlia for BitTorrent) ensures that put/get requests are routed efficiently to the peers responsible for maintaining a given key's IP address lists.

What is DHT P2P?

"DHTs obviated the server from P2P networks. A DHT is a hash table that partitions the keyspace and distributes the parts across a set of nodes. For any new content added to the network, a hash (k) is calculated and a message is sent to any node participating in the DHT.


Ok, they're fundamentally a pretty simple idea. A DHT gives you a dictionary-like interface, but the nodes are distributed across the network. The trick with DHTs is that the node that gets to store a particular key is found by hashing that key, so in effect your hash-table buckets are now independent nodes in a network.

This gives a lot of fault-tolerance and reliability, and possibly some performance benefit, but it also throws up a lot of headaches. For example, what happens when a node leaves the network, by failing or otherwise? And how do you redistribute keys when a node joins so that the load is roughly balanced. Come to think of it, how do you evenly distribute keys anyhow? And when a node joins, how do you avoid rehashing everything? (Remember you'd have to do this in a normal hash table if you increase the number of buckets).

One example DHT that tackles some of these problems is a logical ring of n nodes, each taking responsibility for 1/n of the keyspace. Once you add a node to the network, it finds a place on the ring to sit between two other nodes, and takes responsibility for some of the keys in its sibling nodes. The beauty of this approach is that none of the other nodes in the ring are affected; only the two sibling nodes have to redistribute keys.

For example, say in a three node ring the first node has keys 0-10, the second 11-20 and the third 21-30. If a fourth node comes along and inserts itself between nodes 3 and 0 (remember, they're in a ring), it can take responsibility for say half of 3's keyspace, so now it deals with 26-30 and node 3 deals with 21-25.

There are many other overlay structures such as this that use content-based routing to find the right node on which to store a key. Locating a key in a ring requires searching round the ring one node at a time (unless you keep a local look-up table, problematic in a DHT of thousands of nodes), which is O(n)-hop routing. Other structures - including augmented rings - guarantee O(log n)-hop routing, and some claim to O(1)-hop routing at the cost of more maintenance.

Read the wikipedia page, and if you really want to know in a bit of depth, check out this coursepage at Harvard which has a pretty comprehensive reading list.


DHTs provide the same type of interface to the user as a normal hashtable (look up a value by key), but the data is distributed over an arbitrary number of connected nodes. Wikipedia has a good basic introduction that I would essentially be regurgitating if I write more -

http://en.wikipedia.org/wiki/Distributed_hash_table


I'd like to add onto HenryR's useful answer as I just had an insight into consistent hashing. A normal/naive hash lookup is a function of two variables, one of which is the number of buckets. The beauty of consistent hashing is that we eliminate the number of buckets "n", from the equation.

In naive hashing, first variable is the key of the object to be stored in the table. We'll call the key "x". The second variable is is the number of buckets, "n". So, to determine which bucket/machine the object is stored in, you have to calculate: hash(x) mod(n). Therefore, when you change the number of buckets, you also change the address at which almost every object is stored.

Compare this to consistent hashing. Let's define "R" as the range of a hash function. R is just some constant. In consistent hashing, the address of an object is located at hash(x)/R. Since our lookup is no longer a function of the number of buckets, we end up with less remapping when we change the number of buckets.

http://michaelnielsen.org/blog/consistent-hashing/