I'm just getting into graph databases, and I seem to keep running into a problem deciding between using an "index node" or an "indexed property" for tracking things like "node type". Since I've no real experience thus far, I don't have any information to base the decision on and both approaches seem to be equally valid. So, the question is: What are the tradeoffs between two approaches, and how does scale (ie. number of nodes) affect the decision? For a sample scenario, lets assume there are two types of "things": <code>User</code> and <code>Product</code>, and the edges between the User nodes and the Product nodes don't matter so much, but what we care about is if we want <code>type: User</code> and <code>type: Product</code> properties on each node, or if we want each node to have an edge pointing back at a <code>User</code> node and a <code>Product</code> node, respectively. Which approach is better under which circumstances? Note: I'm looking at Neo4j and Titan in particular, but I would think that this will tend to apply more generally as well.

First, you need to ask yourself: Does the type of a vertex/node need to be indexed? I.e. do you need to retrieve vertices/nodes by their type, let's say, retrieve all 'user' vertices from the graph or do you need to answer queries that start by retrieving all vertices of a given type and then filter/process those further? If the answer to this question is yes, then I suggest you store the type as a string property that is indexed. Or, if you are developing in a jvm based language, you could define a type enum and use that as the property type for more type safety and automatic error checking. Titan supports arbitrary user defined classes/enums as property types and will compress those for a low memory footprint. However, the downside of this approach is that this won't scale because you are building a low selectivity index. What that means is that there will likely be very many vertices of type 'user' or 'product' and all those need to be associated with the index entry for 'user' or 'product' respectively. This makes maintaining and querying this index very expensive and hard to scale (imagine facebook had a 'type' index: the 'photo' entry would have billions of vertices under it). If you are not (yet) concerned with scaling, then this can work. If the answer to the question is no, then I suggest to model types as vertices/nodes in the graph. I.e. have a 'user' vertex and a 'product' vertex and an edge labeled 'type' from each user to the 'user' vertex, etc. The advantage of this approach is that you use the graph to model your data rather than having string values outside of your database represent crucial type information. As you build your application, the graph database will become its central component and last for a long time. As programming languages and developers come and go, you don't want data modeling and type information to go with them and be faced with the question: "What does SPECIAL_USER mean?" Rather, have a SPECIAL_USER vertex and add provenance information to it, i.e., who created this type, what does it represent and a short description - all in the database. One problem with this approach is that the 'user' and 'product' vertices will have a lot of edges incident on them as your application scales. In other words, you are creating supernodes which create scaling issues. This is why Titan introduced the concept of a unidirectional edge. A unidirectional edge is like a link on the web: the starting vertex points to another vertex, but that vertex is unaware of the edge. Since you don't want to traverse from the 'user' vertex to all user vertices, you aren't loosing anything but gaining in scalability and performance.

Why would index nodes or an indexed property be better in a graph database?

Tags:

database-design

graph-databases

neo4j

titan

I'm just getting into graph databases, and I seem to keep running into a problem deciding between using an "index node" or an "indexed property" for tracking things like "node type". Since I've no real experience thus far, I don't have any information to base the decision on and both approaches seem to be equally valid.

So, the question is: What are the tradeoffs between two approaches, and how does scale (ie. number of nodes) affect the decision?

For a sample scenario, lets assume there are two types of "things": User and Product, and the edges between the User nodes and the Product nodes don't matter so much, but what we care about is if we want type: User and type: Product properties on each node, or if we want each node to have an edge pointing back at a User node and a Product node, respectively.

Which approach is better under which circumstances?

_{Note: I'm looking at Neo4j and Titan in particular, but I would think that this will tend to apply more generally as well.}

972

asked Oct 05 '12 22:10

cdeszaq

1 Answers

First, you need to ask yourself: Does the type of a vertex/node need to be indexed? I.e. do you need to retrieve vertices/nodes by their type, let's say, retrieve all 'user' vertices from the graph or do you need to answer queries that start by retrieving all vertices of a given type and then filter/process those further?

If the answer to this question is yes, then I suggest you store the type as a string property that is indexed. Or, if you are developing in a jvm based language, you could define a type enum and use that as the property type for more type safety and automatic error checking. Titan supports arbitrary user defined classes/enums as property types and will compress those for a low memory footprint.

However, the downside of this approach is that this won't scale because you are building a low selectivity index. What that means is that there will likely be very many vertices of type 'user' or 'product' and all those need to be associated with the index entry for 'user' or 'product' respectively. This makes maintaining and querying this index very expensive and hard to scale (imagine facebook had a 'type' index: the 'photo' entry would have billions of vertices under it). If you are not (yet) concerned with scaling, then this can work.

If the answer to the question is no, then I suggest to model types as vertices/nodes in the graph. I.e. have a 'user' vertex and a 'product' vertex and an edge labeled 'type' from each user to the 'user' vertex, etc.

The advantage of this approach is that you use the graph to model your data rather than having string values outside of your database represent crucial type information. As you build your application, the graph database will become its central component and last for a long time. As programming languages and developers come and go, you don't want data modeling and type information to go with them and be faced with the question: "What does SPECIAL_USER mean?" Rather, have a SPECIAL_USER vertex and add provenance information to it, i.e., who created this type, what does it represent and a short description - all in the database.

One problem with this approach is that the 'user' and 'product' vertices will have a lot of edges incident on them as your application scales. In other words, you are creating supernodes which create scaling issues. This is why Titan introduced the concept of a unidirectional edge. A unidirectional edge is like a link on the web: the starting vertex points to another vertex, but that vertex is unaware of the edge. Since you don't want to traverse from the 'user' vertex to all user vertices, you aren't loosing anything but gaining in scalability and performance.

answered Oct 26 '22 19:10

Matthias Broecheler

Related questions
                            
                                Should I use INT, CHAR or VARCHAR for Social Security Number? [closed]
                            
                                Hadoop Hbase: Spreading column families across tables or not
                            
                                how to model discount on items in a database?
                            
                                Without joins on Google App Engine, does your data have to exist in one big table?
                            
                                How do you keep your business rules DRY?
                            
                                Would you use one or two tables for username and password?
                            
                                table structure for personal messages
                            
                                Best primary key for storing URLs
                            
                                MongoDB database architecture [closed]
                            
                                Inserting values into tables Oracle SQL
                            
                                Add a SQL XOR Constraint between two nullable FK's
                            
                                DATABASE DESIGN - Primary key for COUNTRY, CURRENCY int or varchar
                            
                                Using Integer vs String for a "type" value (Database and class design)
                            
                                Composite Clustered Index in SQL Server
                            
                                Optimal database structure - 'wider' table with empty fields or greater number of tables?
                            
                                Database Design Question - Categories / Subcategories
                            
                                CouchDB create database per document type?
                            
                                How to emulate tagged union in a database?
                            
                                Are PostgreSQL VIEWS created newly each time they are queried against?
                            
                                Would relational databases scale as well (or better) than their NoSQL counterparts if we drop the relationships?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With