Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance-oriented design of a graph-based (key/value) database

I am in the preparation phase of the design a graph-based (or key-value) database library for C++ which many here will find similar to projects such as http://neo4j.org/.

Since this is a very early phase of design, my requirements are simple, unrefined and (I admit) probably still rather naive:

  • A directed acyclic graph
    • A tree-like structure with few roots and many leaves
    • Branches may contain references to other branches
    • But no cycles
    • The graph is represented by key-value pair where keys and values are simple types (integers) for the most part but some may refer to more complex types such as strings
  • Queries
    • Simple queries will usually return edges. I.e. What edges starting from this root, corresponds with (key / value / key-value tupple)?
    • Queries using strings of keys (key, key, key, value)
  • Access patterns and performance
    • Fast lookups should be emphasized
    • Addition of edges
    • But no removal of edges/nodes from the graph. I.e. The graph will grow but it will never shrink.
    • Optimizations may be performed on the graph in order to optimize the memory layout for cache usage
    • Size of the graph is in the order of around 1 MB - 2 GB and should for the most part fit into primary memory

Given these rough requirements as a challenge, what would your primary concerns be with regard to:

  • Memory storage: Layout, allocation
    • E.g. Pool of fixed size blocks?
    • Memory assignment by a clustering algorithm?
  • Fast queries
  • Dynamic reorganization
    • How to handle addition of edges/nodes?
    • Updates for optimization (e.g. improving memory layout)
  • Concurrent access
    • E.g. Handling changes to memory layout by an optimizing thread?

I am looking for good starting points from which to work, hence I'm more than happy to receive references to existing work. Most importantly: What should I be thinking about that I am not thinking about?

like image 571
Rehno Lindeque Avatar asked Feb 14 '26 19:02

Rehno Lindeque


2 Answers

But no cycles

This is a heavy requirement if you need fast edge inserts. Verifying that a new edge does not introduce a cycle is O(v+e) in the worst case (where v is the number of vertices and e is the number of edges). It probably also rules out concurrent insertion of edges. Consider to make this requirement optional.

Another option is to distinguish between two insert operations: CheapInsert and ExpensiveInsert. Let each vertice have a "rank" integer and only allow cheap inserts of edges from lower to higher rank vertices. The expensive insert will not have this constraint and will automatically rewrite the ranks if necessary. Clients can inspect and change the rank of any vertice (as long as the low-to-high rule isn't broken). This way they can implement their own insert strategy, possibly by exploiting some specific properties of the graph to avoid expensive inserts.

like image 138
Wim Coenen Avatar answered Feb 17 '26 11:02

Wim Coenen


A fast key/value database you might be interested in for reference is TinyCDB http://www.corpit.ru/mjt/tinycdb.html

like image 29
hookenz Avatar answered Feb 17 '26 11:02

hookenz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!