Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Graph databases utilizing locality

DAG = directed acyclic graph; roots = vertices without incoming edges.

I have a DAG larger than available RAM, so I need a disk-based graph database to work with it.

My DAG is shallow: I have billions of roots nodes, but from each node only dozens of nodes are reachable.

It is also not well connected: majority of the nodes have only one incoming edge. So for any couple of root nodes reachable subgraphs usually have very few nodes in common.

So my DAG can be thought of as a large number of small trees, only few of which intersect.

I need to perform the following queries on my DAG in bulk numbers: given a root node, get all nodes reachable from it.

It can be thought as a batch query: given few thousands of root nodes, return all nodes reachable from there.

As far as I know there are algorithms to improve disk storage locality for graphs. Three examples are:

  • http://ceur-ws.org/Vol-733/paper_pacher.pdf
  • http://www.cs.ox.ac.uk/dan.olteanu/papers/g-store.pdf
  • http://graphlab.org/files/osdi2012-kyrola-blelloch-guestrin.pdf

It also seems there are older generation graph databases that don't utilize graph locality. for example a popular Neo4j graph database:

http://www.ibm.com/developerworks/library/os-giraph/

Neo4j relies on data access methods for graphs without considering data locality, and the processing of graphs entails mostly random data access. For large graphs that cannot be stored in memory, random disk access becomes a performance bottleneck.

My question is: are there any graph databases suited well for my workload?

Support for Win64 and a possibility to work with database from something else than Java is a plus.

like image 215
nponeccop Avatar asked Nov 11 '22 06:11

nponeccop


1 Answers

From the task itself it doesn't seem that you need a graph database. You can simply use some external-memory programming library, such as stxxl. First perform topological sort on the graph (in edge format). Then you only sequentially scan until you finish all the "root nodes". The I/O complexity is bounded by the topological sort. Actually you don't need a topo sort, just need to identify the root nodes. This can be done by a join with edge table and node table, which is linear time.

like image 77
lgylym Avatar answered Nov 15 '22 09:11

lgylym