Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would index nodes or an indexed property be better in a graph database?

I'm just getting into graph databases, and I seem to keep running into a problem deciding between using an "index node" or an "indexed property" for tracking things like "node type". Since I've no real experience thus far, I don't have any information to base the decision on and both approaches seem to be equally valid.

So, the question is: What are the tradeoffs between two approaches, and how does scale (ie. number of nodes) affect the decision?

For a sample scenario, lets assume there are two types of "things": User and Product, and the edges between the User nodes and the Product nodes don't matter so much, but what we care about is if we want type: User and type: Product properties on each node, or if we want each node to have an edge pointing back at a User node and a Product node, respectively.

Which approach is better under which circumstances?

Note: I'm looking at Neo4j and Titan in particular, but I would think that this will tend to apply more generally as well.

like image 972
cdeszaq Avatar asked Oct 05 '12 22:10

cdeszaq


People also ask

Why would you create an index using Neo4j?

In neo4j you can create index for both property and nodes. Indexing is data structure that helps faster performance on retrieval operation on database. There is special features in neo4j indexing once you create indexing that index will manage itself and keep it up to date whenever changes made on the database.

What is the purpose of using indexing?

Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

What is the benefit of index-free adjacency?

At write time, index-free adjacency speeds up processing by ensuring that each node is stored directly to its adjacent nodes and relationships. Then, during query processing (i.e., read time), index-free adjacency ensures lightning-fast retrieval without a heavy reliance on indexes.

What statement best describes index-free adjacency for a Neo4j database?

This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing.


1 Answers

First, you need to ask yourself: Does the type of a vertex/node need to be indexed? I.e. do you need to retrieve vertices/nodes by their type, let's say, retrieve all 'user' vertices from the graph or do you need to answer queries that start by retrieving all vertices of a given type and then filter/process those further?

If the answer to this question is yes, then I suggest you store the type as a string property that is indexed. Or, if you are developing in a jvm based language, you could define a type enum and use that as the property type for more type safety and automatic error checking. Titan supports arbitrary user defined classes/enums as property types and will compress those for a low memory footprint.

However, the downside of this approach is that this won't scale because you are building a low selectivity index. What that means is that there will likely be very many vertices of type 'user' or 'product' and all those need to be associated with the index entry for 'user' or 'product' respectively. This makes maintaining and querying this index very expensive and hard to scale (imagine facebook had a 'type' index: the 'photo' entry would have billions of vertices under it). If you are not (yet) concerned with scaling, then this can work.

If the answer to the question is no, then I suggest to model types as vertices/nodes in the graph. I.e. have a 'user' vertex and a 'product' vertex and an edge labeled 'type' from each user to the 'user' vertex, etc.

The advantage of this approach is that you use the graph to model your data rather than having string values outside of your database represent crucial type information. As you build your application, the graph database will become its central component and last for a long time. As programming languages and developers come and go, you don't want data modeling and type information to go with them and be faced with the question: "What does SPECIAL_USER mean?" Rather, have a SPECIAL_USER vertex and add provenance information to it, i.e., who created this type, what does it represent and a short description - all in the database.

One problem with this approach is that the 'user' and 'product' vertices will have a lot of edges incident on them as your application scales. In other words, you are creating supernodes which create scaling issues. This is why Titan introduced the concept of a unidirectional edge. A unidirectional edge is like a link on the web: the starting vertex points to another vertex, but that vertex is unaware of the edge. Since you don't want to traverse from the 'user' vertex to all user vertices, you aren't loosing anything but gaining in scalability and performance.

like image 88
Matthias Broecheler Avatar answered Oct 26 '22 19:10

Matthias Broecheler