Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to manipulate a large (doesn't fit into memory) graph [closed]

I need to maintain a large directed graph G, with possibly millions of nodes and edges. Potentially it may not fit in memory.

Some frequent operations I need to perform on this graph include:

  1. Each node/edges will have user defined properties associated with it, such as access count, weight, etc.

  2. For each node (vertex), I will need to perform efficient query based on the property values. For example, find the node which has X value greater than v1 but less than v2. This probably requires build an index on certain fields.

  3. I will need to find all incoming edges and outgoing edges of a given node, and update the weight of the edges.

  4. I will need to do local (DFS based) traversal from a given node, and return all paths which satisfy a certain user defined predicate (this predicate may use the property values of the node/edges in a path).

  5. I will need to add/delete nodes/edges efficiently. This is not performed as often as operation 1, 2, 3 though.

Potentially there are some hot-spots in the graph which gets accessed much more often than the other parts, and I would like to cache these hot-spots in memory.

What is the efficient way to achieve this with the least implementation effort?

I am looking at some disk-based graph databases, such as Neo4j/InfiniteGraph/DEX. Even though they support all the above operations, it seem to be an overkill since I don't need a lot of features they are offering, such as consistency/concurrent control or cluster-based replication. Also, a lot of them are based on Java, and I prefer something with a C/C++ interface.

Basically I just need an on-disk graph library which handles persistence, query on nodes and local traversal efficiently. Do you have any recommendation on existing (open source) project which I can use? If not, what's the best way to implement such a thing?

like image 884
yangsuli Avatar asked Jun 06 '13 00:06

yangsuli


1 Answers

I have seen some large graphs with millions upon millions of nodes. What i reccomend is that you find a point, you should do a weighted compression. So you will take N nodes, compress into N/M nodes, using averages and weights.... and then rebuild the graph.

You would recompress after every so many nodes, of your choice. THe reason is that as EVERYTHING gets HUGE, you will be able to, in a sense, normalize it over time.

You have a directed graph. As you pass larger upon large nodes, you can say that, if A>B>(E&D)>H, you can then say stuff like: A>H.

It does allow you to determine common routes between Nodes, based on shortest jumps between Nodes. If it isnt in a compressed list, it will at least head towards a certain area, and can be, depending... decompressed in some sense.

like image 68
Fallenreaper Avatar answered Nov 10 '22 00:11

Fallenreaper