Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

I am currently trying to merge three datasets for analysis purposes. I am using certain common fields to establish the connections between the datasets. In order to create the connections I have tried using the following type of query:

MATCH (p1:Person),(p2:Person)
WHERE p1.email = p2.email AND p1.name = p2.name AND p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Which can be similarly written as:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
CREATE UNIQUE (p1)-[IS]-(p2);

Needless to say, this is a very slow query on a database with about 100,000 Person nodes, specially given that Neo4j does not process single queries in parallel.

Now, my question is whether there is any better way to run such queries in Neo4j. I have at least eight CPU cores to dedicate to Neo4j, as long as separate threads don't tie up by locking each others' required resources.

The issue is that I don't know how Neo4j builds its Cypher execution plans. For instance, let's say I run the following test query:

MATCH (p1:Person),(p2:Person {name:p1.name, email:p1.email})
WHERE p1 <> p2 
RETURN p1, p2
LIMIT 100;

Despite the LIMIT clause, Neo4j still takes a considerable amount of time to turn in the results, what makes me wonder whether even for such a limited query Neo4j produces the whole cartesian product table before considering the LIMIT statement.

I appreciate any help, whether it addresses this specific issue or just gives me an understanding of how Neo4j generally builds Cypher execution plans (and thus how to optimize queries in general). Can legacy Lucene indexes be of any help here?

like image 236
retrography Avatar asked Jun 26 '14 18:06

retrography


People also ask

What can you do to improve the performance of Neo4j?

The size of the available heap memory is an important aspect for the performance of Neo4j. Generally speaking, it is beneficial to configure a large enough heap space to sustain concurrent operations. For many setups, a heap size between 8G and 16G is large enough to run Neo4j reliably.

How many relationships can Neo4j handle?

The standard store format of neo4j allows for 65k different relationship types.

How do I add multiple nodes in Neo4j?

The create clause of Neo4j CQL is also used to create multiple nodes at the same time. To do so, you need to pass the names of the nodes to be created, separated by a comma.

Can a node have multiple labels Neo4j?

Neo4j CQL CREATE a Node Label We can say this Label name to a Relationship as "Relationship Type". We can use CQL CREATE command to create a single label to a Node or a Relationship and multiple labels to a Node. That means Neo4j supports only single Relationship Type between two nodes.


1 Answers

You can do a combination of a label scan for p1 and then index lookup + comparison for p2:

see here:

cypher 2.1 
foreach (i in range(1,100000) | 
  create (:Person {name:"John Doe"+str(i % 10000),
                   email:"john"+str(i % 10000)+"@doe.com"}));
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 100000
Properties set: 200000
Labels added: 100000
6543 ms
neo4j-sh (?)$ CREATE INDEX ON :Person(name);
+-------------------+
| No data returned. |
+-------------------+
Indexes added: 1
28 ms

neo4j-sh (?)$ schema
Indexes
  ON :Person(name)  ONLINE

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email 
return count(*);
+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
8206 ms

neo4j-sh (?)$ 
match (p1:Person) with p1 
match (p2:Person {name:p1.name}) using index p2:Person(name) 
where p1<>p2 AND p2.email = p1.email
merge (p1)-[:IS]-(p2) 
return count(*);

+----------+
| count(*) |
+----------+
| 900000   |
+----------+
1 row
Relationships created: 450000
40256 ms
like image 144
Michael Hunger Avatar answered Sep 18 '22 10:09

Michael Hunger