Performance-oriented design of a graph-based (key/value) database

Question

I am in the preparation phase of the design a graph-based (or key-value) database library for C++ which many here will find similar to projects such as http://neo4j.org/.

Since this is a very early phase of design, my requirements are simple, unrefined and (I admit) probably still rather naive:

A directed acyclic graph
- A tree-like structure with few roots and many leaves
- Branches may contain references to other branches
- But no cycles
- The graph is represented by key-value pair where keys and values are simple types (integers) for the most part but some may refer to more complex types such as strings
Queries
- Simple queries will usually return edges. I.e. What edges starting from this root, corresponds with (key / value / key-value tupple)?
- Queries using strings of keys (key, key, key, value)
Access patterns and performance
- Fast lookups should be emphasized
- Addition of edges
- But no removal of edges/nodes from the graph. I.e. The graph will grow but it will never shrink.
- Optimizations may be performed on the graph in order to optimize the memory layout for cache usage
- Size of the graph is in the order of around 1 MB - 2 GB and should for the most part fit into primary memory

Given these rough requirements as a challenge, what would your primary concerns be with regard to:

Memory storage: Layout, allocation
- E.g. Pool of fixed size blocks?
- Memory assignment by a clustering algorithm?
Fast queries
Dynamic reorganization
- How to handle addition of edges/nodes?
- Updates for optimization (e.g. improving memory layout)
Concurrent access
- E.g. Handling changes to memory layout by an optimizing thread?

I am looking for good starting points from which to work, hence I'm more than happy to receive references to existing work. Most importantly: What should I be thinking about that I am not thinking about?

Wim Coenen · Accepted Answer

But no cycles

This is a heavy requirement if you need fast edge inserts. Verifying that a new edge does not introduce a cycle is O(v+e) in the worst case (where v is the number of vertices and e is the number of edges). It probably also rules out concurrent insertion of edges. Consider to make this requirement optional.

Another option is to distinguish between two insert operations: CheapInsert and ExpensiveInsert. Let each vertice have a "rank" integer and only allow cheap inserts of edges from lower to higher rank vertices. The expensive insert will not have this constraint and will automatically rewrite the ranks if necessary. Clients can inspect and change the rank of any vertice (as long as the low-to-high rule isn't broken). This way they can implement their own insert strategy, possibly by exploiting some specific properties of the graph to avoid expensive inserts.

hookenz · Answer

A fast key/value database you might be interested in for reference is TinyCDB http://www.corpit.ru/mjt/tinycdb.html

Performance-oriented design of a graph-based (key/value) database

Tags:

c++

performance

key-value

graph-databases

Rehno Lindeque

2 Answers

Wim Coenen

hookenz

Recent Activity

Donate For Us

Performance-oriented design of a graph-based (key/value) database

Tags:

c++

performance

key-value

graph-databases

Rehno Lindeque

2 Answers

Wim Coenen

hookenz

Related questions

Recent Activity

Donate For Us