I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome) WHERE g.chromosomeID = c.chromosomeID CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns. If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
If the query is not already in the Execution Plan Cache, the query is compiled into an execution plan in the Neo4j DBMS. The execution plan executes in the Neo4j DBMS to retrieve data. The Page Cache is used to hold the data in memory.
An OPTIONAL MATCH matches patterns against your graph database, just like a MATCH does. The difference is that if no matches are found, OPTIONAL MATCH will use a null for missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher equivalent of the outer join in SQL.
When you want to return all nodes, relationships and paths found in a query, you can use the * symbol. This returns the two nodes, the relationship and the path used in the query.
The MATCH clause allows you to specify the patterns Neo4j will search for in the database. This is the primary way of getting data into the current set of bindings. It is worth reading up more on the specification of the patterns themselves in Patterns.
The browser is telling you that:
Gene
instance and every Chromosome
instance. If your DB has G
genes and C
chromosomes, then the complexity of the query is O(GC)
. For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000
comparisons.You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID)
, and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G)
(or 25000
) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome) WITH c MATCH (g:Gene) WHERE g.chromosomeID = c.chromosomeID CREATE (g)-[:PART_OF]->(c);
It uses a WITH
clause to force the first MATCH
clause to execute first, avoiding the cartesian product. The second MATCH
(and WHERE
) clause uses the results of the first MATCH
clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH
clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH
is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH
.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene
and Chromosome
nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH
e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
MATCH
a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
MATCH
both a Gene
and a Chromosome
without any relationships, you should split up the queryIn case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With