Neo4j super node issue - fanning out pattern

Tags:

I'm new to the Graph Database scene, looking into Neo4j and learning Cypher, we're trying to model a graph database, it's a fairly simple one, we got users, and we got movies, users can VIEW movies, RATE movies, create playlists and playlists can HAVE movies.

The question is regarding the Super Node performance issue. And I will quote something from a very good book I am currently reading - Learning Neo4j by Rik Van Bruggen, so here it is:

A very interesting problem then occurs in datasets where some parts of the graph are all connected to the same node. This node, also referred to as a dense node or a supernode, becomes a real problem for graph traversals because the graph database management system will have to evaluate all of the connected relationships to that node in order to determine what the next step will be in the graph traversal.

The solution to this problem proposed in the book is to have a Meta node with 100 connections to it, and the 101th connection to be linked to a new Meta node that is linked to the previous Meta Node.

I have seen a blog post from the official Neo4j Blog saying that they will fix this problem in the upcoming future (the blog post is from January 2013) - http://neo4j.com/blog/2013-whats-coming-next-in-neo4j/

More exactly they say:

Another project we have planned around “bigger data” is to add some specific optimizations to handle traversals across densely-connected nodes, having very large numbers (millions) of relationships. (This problem is sometimes referred to as the “supernodes” problem.)

What are your opinions on this issue? Should we go with the Meta node fanning-out pattern or go with the basic relationship that every tutorial seem to be using? Any other suggestions?

837

asked Dec 19 '14 14:12

Eek

2 Answers

UPDATE - October 2020. This article is the best source on this topic, covering all aspects of super nodes

(my original answer below)

It's a good question. This isn't really an answer, but why shouldn't we be able to discuss this here? Technically I think I'm supposed to flag your question as "primarily opinion based" since you're explicitly soliciting opinions, but I think it's worth the discussion.

The boring but honest answer is that it always depends on your query patterns. Without knowing what kinds of queries you're going to issue against this data structure, there's really no way to know the "best" approach.

Supernodes are problems in other areas as well. Graph databases sometimes are very difficult to scale in some ways, because the data in them is hard to partition. If this were a relational database, we could partition vertically or horizontally. In a graph DB when you have supernodes, everything is "close" to everything else. (An Alaskan farmer likes Lady Gaga, so does a New York banker). Moreso than just graph traversal speed, supernodes are a big problem for all sorts of scalability.

Rik's suggestion boils down to encouraging you to create "sub-clusters" or "partitions" of the super-node. For certain query patterns, this might be a good idea, and I'm not knocking the idea, but I think hidden in here is the notion of a clustering strategy. How many meta nodes do you assign? How many max links per meta-node? How did you go about assigning this user to this meta node (and not some other)? Depending on your queries, those questions are going to be very hard to answer, hard to implement correctly, or both.

A different (but conceptually very similar) approach is to clone Lady Gaga about a thousand times, and duplicate her data and keep it in sync between nodes, then assert a bunch of "same as" relationships between the clones. This isn't that different than the "meta" approach, but it has the advantage that it copies Lady Gaga's data to the clone, and the "Meta" node isn't just a dumb placeholder for navigation. Most of the same problems apply though.

Here's a different suggestion though: you have a large-scale many-to-many mapping problem here. It's possible that if this is a really huge problem for you, you'd be better off breaking this out into a single relational table with two columns (from_id, to_id), each referencing a neo4j node ID. You then might have a hybrid system that's mostly graph (but with some exceptions). Lots of tradeoffs here; of course you couldn't traverse that rel in cypher at all, but it would scale and partition much better, and querying for a particular rel would probably be much faster.

One general observation here: whether we're talking about relational, graph, documents, K/V databases, or whatever -- when the databases get really big, and the performance requirements get really intense, it's almost inevitable that people end up with some kind of a hybrid solution with more than one kind of DBMS. This is because of the inescapable reality that all databases are good at some things, and not good at others. So if you need a system that's good at most everything, you're going to have to use more than one kind of database. :)

There is probably quite a bit neo4j can do to optimize in these cases, but it would seem to me that the system would need some kinds of hints on access patterns in order to do a really good job at that. Of the 2,000,000 relations present, how to the endpoints best cluster? Are older relationships more important than newer, or vice versa?

answered Oct 28 '22 06:10

FrobberOfBits

Re. the Neo4j blog, dense node support should be enhanced in Neo4j 2.1 (and above), see also http://neo4j.com/blog/neo4j-2-1-graph-etl/

answered Oct 28 '22 08:10

zazi

Related questions
                            
                                using-declaration in derived class does not hide same function derived from base class
                            
                                static_assert dependent on non-type template parameter (different behavior on gcc and clang)
                            
                                How to write log messages to file using Spring Boot?
                            
                                How can I find the size of a RDD
                            
                                Behavior of Assembly.GetTypes() changed in Visual Studio 2015
                            
                                Getting the accuracy for multi-label prediction in scikit-learn
                            
                                How to get all process ids without ps command on Linux
                            
                                What does $$ and <generated> mean in Java stacktrace?
                            
                                Android Google SignIn not working in debug mode: GoogleSignInResult is false
                            
                                Generator is not an iterator?
                            
                                How can I use decorators today?
                            
                                What is node_modules directory in AngularJS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With