Neo4j Cypher query to find nodes that are not connected too slow

Tags:

Given we have the following Neo4j schema (simplified but it shows the important point). There are two types of nodes NODE and VERSION. VERSIONs are connected to NODEs via a VERSION_OF relationship. VERSION nodes do have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited. NODEs can be connected via a HAS_CHILD relationship. Again these relationships have two properties from and until that denote the validity timespan - either or both can be NULL (nonexistent in Neo4j terms) to denote unlimited.

EDIT: The validity dates on VERSION nodes and HAS_CHILD relations are independent (even though the example coincidentally shows them being aligned).

enter image description here

The example shows two NODEs A and B. A has two VERSIONs AV1 until 6/30/17 and AV2 starting from 7/1/17 while B only has one version BV1 that is unlimited. B is connected to A via a HAS_CHILD relationship until 6/30/17.

The challenge now is to query the graph for all nodes that aren't a child (that are root nodes) at one specific moment in time. Given the example above, the query should return just B if the query date is e.g. 6/1/17, but it should return B and A if the query date is e.g. 8/1/17 (because A isn't a child of B as of 7/1/17 any more).

The current query today is roughly similar to that one:

MATCH (n1:NODE)
OPTIONAL MATCH (n1)<-[c]-(n2:NODE), (n2)<-[:VERSION_OF]-(nv2:ITEM_VERSION)
WHERE (c.from <= {date} <= c.until)
AND (nv2.from <= {date} <= nv2.until)
WITH n1 WHERE c IS NULL 
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
RETURN n1, nv1 
ORDER BY toLower(nv1.title) ASC 
SKIP 0 LIMIT 15

This query works relatively fine in general but it starts getting slow as hell when used on large datasets (comparable to real production datasets). With 20-30k NODEs (and about twice the number of VERSIONs) the (real) query takes roughly 500-700 ms on a small docker container running on Mac OS X) which is acceptable. But with 1.5M NODEs (and about twice the number of VERSIONs) the (real) query takes a little more than 1 minute on a bare-metal server (running nothing else than Neo4j). This is not really acceptable.

Do we have any option to tune this query? Are there better ways to handle the versioning of NODEs (which I doubt is the performance problem here) or the validity of relationships? I know that relationship properties cannot be indexed, so there might be a better schema for handling the validity of these relationships.

Any help or even the slightest hint is greatly appreciated.

EDIT after answer from Michael Hunger:

Percentage of root nodes:

With the current example data set (1.5M nodes) the result set contains about 2k rows. That's less than 1%.

ITEM_VERSION node in first MATCH:

We're using the ITEM_VERSION nv2 to filter the result set to ITEM nodes that have no connection other ITEM nodes at the given date. That means that either no relationship must exist that is valid for the given date or the connected item must not have an ITEM_VERSION that is valid for the given date. I'm trying to illustrate this:

// date 6/1/17

// n1 returned because relationship not valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...6/30/17]->(n2)<-(nv2 ...)

// n1 not returned because relationship and connected item n2 valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...)

// n1 returned because connected item n2 not valid even though relationship is valid
(nv1 ...)->(n1)-[X_HAS_CHILD ...]->(n2)<-(nv2 ...6/30/17)

No use of relationship-types:

The problem here is that the software features a user-defined schema and ITEM nodes are connected by custom relationship-types. As we can't have multiple types/labels on a relationship the only common characteristic for these kind of relationships is that they all start with X_. That's been left out of the simplified example here. Would searching with the predicate type(r) STARTS WITH 'X_' help here?

616

asked Dec 23 '16 11:12

Stefan Gehrig

1 Answers

What Neo4j version are you using.

What percentage of your 1.5M nodes will be found as roots at your example date, and if you don't have the limit how much data comes back? Perhaps the issue is not in the match so much as in the sorting at the end?

I'm not sure why you had the VERSION nodes in your first part, at least you don't describe them as relevant for determining a root node.

You didn't use relationship-types.

MATCH (n1:NODE) // matches 1.5M nodes
// has to do 1.5M * degree optional matches
OPTIONAL MATCH (n1)<-[c:HAS_CHILD]-(n2) WHERE (c.from <= {date} <= c.until)
WITH n1 WHERE c IS NULL
// how many root nodes are left?
// # root nodes * version degree (1..2)
MATCH (n1)<-[:VERSION_OF]-(nv1:ITEM_VERSION)
WHERE nv1.from <= {date} <= nv1.until
// has to sort all those
WITH n1, nv1, toLower(nv1.title) as title
RETURN n1, nv1
ORDER BY title ASC 
SKIP 0 LIMIT 15

167

answered Oct 14 '22 16:10

Michael Hunger

Related questions
                            
                                Generating synthetic social networks?
                            
                                List of problems that are in general NP-hard but have polynomial-time solution in planar graphs?
                            
                                Java Library to generate interactive Graph
                            
                                library to detect overlapping communities in a network?
                            
                                Split matrix into 4 sub-matrices with lowest difference between their sum
                            
                                Shortest path that traverses a list of required edges
                            
                                Card Algorithm Game
                            
                                ArangoDB - how to implement a custom recommendation engine using graph?
                            
                                Time Aware Social Graph DS/Queries
                            
                                C Directed Graph Implementation Choice
                            
                                Graphviz can't find any fonts
                            
                                SAS: How do I use ODS layout to put 8 graphs on 2 PDF pages?
                            
                                Dynamically updating shortest paths
                            
                                Boost Graph Library: Potential Bug
                            
                                Changing alpha doesn't affect anything in ggplot2
                            
                                Graphing Algorithm for many nodes
                            
                                how to build a graph from tuples in graphx and label the nodes after ?
                            
                                Modified Shortest Path Algorithm (Dijkstra's)
                            
                                How to visualize (dendrogram) a dictionary of hierarchical items?
                            
                                Render HTML on X-Axis Ticks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Neo4j Cypher query to find nodes that are not connected too slow

Tags:

graph

neo4j

cypher

Stefan Gehrig

People also ask

1 Answers

Michael Hunger

Recent Activity

Donate For Us