Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Titan achieve constant time lookup using HBase / Cassandra?

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:

To understand why native graph processing is so much more efficient than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships. To traverse a network of m steps, the cost of the indexed approach, at O(m log n), dwarfs the cost of O(m) for an implementation that uses index-free adjacency.

It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:

With fixed sized records and pointer-like record IDs, traversals are implemented simply by chasing pointers around a data structure, which can be performed at very high speed. To traverse a particular relationship from one node to another, the database performs several cheap ID computations (these computations are much cheaper than searching global indexes, as we’d have to do if faking a graph in a non-graph native database)

This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?

like image 441
Lodewijk Bogaards Avatar asked Sep 24 '14 05:09

Lodewijk Bogaards


1 Answers

Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).

Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.

Please see the following blog post that explains the above quantitatively: http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/

Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.

Please see the following two blog posts for more information:

http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/ http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/

like image 138
Marko A. Rodriguez Avatar answered Jan 17 '23 19:01

Marko A. Rodriguez