Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitioning a weighted directed graph (over key/value database)

We would like to shard a weighted directed graph,

The user can add nodes and edges dynamically, at first the DB/Graph is empty.

We keep the nodes and edges in a key/value database (probably Redis): For each node, we will have the nodeId as the key and a sortedset of keys of referenced nodes the score of each nodeId in the sortedSet is the weight of the edge.

(See question regarding that here: Redis: Implement Weighted Directed Graph)

We don't have a balance constraint, the most common action over the graph is Dijkstra, and we had like to minimize the I/O(network in our case)

Possible solution: each DB server contains a list of other servers with IPs:

key:server1, value:....250.1

key:server2, value:....250.2

key:server3, value:....250.3

and each nodeId will be serverX.originalNodeId

What would be the algorithm that decides which node goes where? should we support re-positioning of a node?

I guess that the naive approach would be, add node A to serverX where argmax(# of nodes in server X that have edges with node A), as long as serverX is not fully occupied..

like image 453
DuduAlul Avatar asked Nov 05 '22 19:11

DuduAlul


1 Answers

Since processing happens client side, this sort of graph data isn't too difficult to shard - all you need at each step is a single sorted set, so it doesn't matter which node that set is loaded from. Getting the actual data to go with the node happens as the final step - that will be a simple MGET if you have only one node, and is fairly easy to split across several nodes.

To determine which node a key will be stored on, you should use a hash rather than trying to track them manually. I use a table mapping a range of hashes to a particular node. It is stored in redis for long term persistence but is really part of the client. To access a particular key you just get the hash of the key, look it up in the table, and connect to that node. Using a table with thousands of slots makes it easy to move data to another node - update the table and requests for a particular slot will be going to a different node. This is fairly similar to, though not exactly the same as the approach used in redis cluster.

That said, my reason for setting up sharding was not graph data. Small sorted sets containing only IDs don't take up much memory - you should be able to handle 100 million edges on a single node without too much trouble.

like image 124
Tom Clarkson Avatar answered Nov 09 '22 13:11

Tom Clarkson