Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neo4j use MERGE with null values

I know this question has been asked several times before but the answers didn't solve my problem. I am trying to execute this query:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1 {address_name1:line1.address1})

but I get the error: Cannot merge node using null property value for address_name1.

Other people suggested using:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1)
 ON CREATE SET a.address_name1=line1.address1
 ON MATCH SET a.address_name1=line1.address1

but this solution works if the node has more than one property. In my case it has only address_name1 property.

Is there a way to work around this problem, like replacing the null values with a word in the query before the MERGE, or some other solution?

like image 751
Porjaz Avatar asked Nov 24 '16 14:11

Porjaz


People also ask

How does merge work in Neo4j?

What is MERGE, and how does it work? The MERGE clause ensures that a pattern exists in the graph. Either the entire pattern already exists, or the entire pattern needs to be created. In this way, it's helpful to think of MERGE as attempting a MATCH on the pattern, and if no match is found, a CREATE of the pattern.

What is coalesce in Neo4j?

The function coalesce() returns the first non- null value in the given list of expressions. Syntax: coalesce(expression [, expression]*) Returns: The type of the value returned will be that of the first non- null expression.

How do I merge two nodes in Neo4j?

You can merge a node in the database based on the label using the MERGE clause. If you try to merge a node based on the label, then Neo4j verifies whether there exists any node with the given label. If not, the current node will be created.


2 Answers

If there's no address, do you really need to create the Address node at all?

You can filter the lines from the CSV using WITH/WHERE:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
WITH line1
WHERE NOT line1.address1 IS NULL
MERGE (a:Address1 {address_name1:line1.address1})

Otherwise, if you want to create a node representing the "unknown" address, you can use coalesce() to substitute a default value:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1 {address_name1: coalesce(line1.address1, "Unknown")})
like image 161
Frank Pavageau Avatar answered Oct 21 '22 16:10

Frank Pavageau


Hello: I am posting this rather extensive answer as I recently encountered surprising difficulty dealing with NULL (missing) values present in my CSV files, when attempting to load those data into Neo4j (neo4j 3.3.4).

I present three solutions.

I am using the Cycli (cycli 0.7.6) CLI, installed via pip in a Python 3.5 venv on an Arch Linux x86_64 system.

My CSV file (glycolysis_metabolites.csv) is:

name,abbreviation,kegg_entry
α-D-glucose,GLC,C00267
glucose 6-phosphate,G6P,C00668
fructose 6-phosphate,F6P,C05345
"fructose 1,6-bisphosphate",FBP,C05378
dihydroxyacetone phosphate,DHAP,C00111
D-glyceraldehyde 3-phosphate,,C00118
"1,3-bisphosphoglycerate","1,3-BPG",C00236
3-phosphoglycerate,3PG,C00197
2-phosphoglycerate,2PG,C00631
phosphoenolpyruvate,PEP,C00074
pyruvate,,C00022

Those data, copied from a PostgreSQL table via the psql /COPY ... command, have a "UNIQUE NOT NULL" constraint on the "name" field.

After surveying Google etc. I conducted three Experiments, below. Experiments 2 and 3 are basically the same.

I believe that the approach shown in Experiment 2 is the best solution, as the COALESCE statements are included within the MERGE statement.

My reason for this conclusion is that Experiment 2 uses "local" variables, rather than returning "global" variables (Experiment 3), thus minimizing unintended consequences on reused variable names.

I load my Cypher script as follows:

cat glycolysis_script.cypher |  cypher-shell -u victoria -p <your_password>

** EXPERIMENT 1**

Reference: http://markhneedham.com/blog/2014/08/22/neo4j-load-csv-handling-empty-columns/

This solution (Mark Needham's) is pretty clever: it creates nodes containing all non-NULL properties, e.g.

<id>: 0 abbreviation: GLC kegg_entry: C00267 name: α-D-glucose <id>: 10 kegg_entry: C00022 name: pyruvate

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
MERGE (a:GlycolysisMetabolites {name: row.name})
FOREACH(ignoreMe IN CASE WHEN row.abbreviation <> "" THEN [1] ELSE [] END | SET a.abbreviation = row.abbreviation)
FOREACH(ignoreMe IN CASE WHEN row.kegg_entry <> "" THEN [1] ELSE [] END | SET a.kegg_entry = row.kegg_entry)
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;

OUTPUT:

$ cat glycolysis.cypher |  cypher-shell -u victoria -p <your_password>

a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"

However, you cannot set your own MERGE specifications, on properties that include NULL values (here: "abbreviation") -- the reason being that you cannot MERGE on NULL property values.

Works:

MERGE (a:GlycolysisMetabolites {name: row.name})

Fails ("Cannot merge node using null property value for abbreviation"):

MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation, kegg_entry:row.kegg_entry})

EXPERIMENT 2

Reference: Neo4j use MERGE with null values

Here, I set an empty string ('') as a replacement for the NULL values present in the CSV file; you can use whatever you want; e.g.: 'Undefined', 'null', ...

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
// MERGE (a:GlycolysisMetabolites {name: row.name})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:COALESCE(row.abbreviation, ''), kegg_entry:COALESCE(row.kegg_entry, '')})
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;

OUTPUT:

$ cat glycolysis.cypher |  cypher-shell -u victoria -p <your_password>

a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"

EXPERIMENT 3

References:

Neo4j use MERGE with null values

https://github.com/neo4j/neo4j/issues/2521

This also works, but as the COALESCE statements are outside the MERGE statement, I worry that the data returned by the RETURN statement could cause problems, if those variable names are reused elsewhere. As a workaround, I added a prefix (a_) as a quasi-UID, but I think the solution in Experiment 2, above, is the better approach.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
WITH
  COALESCE(CASE row.name WHEN '' THEN null ELSE row.name END, '') AS a_name,
  COALESCE(CASE row.abbreviation WHEN '' THEN null ELSE row.abbreviation END, '') AS a_abbreviation,
  COALESCE(CASE row.kegg_entry WHEN '' THEN null ELSE row.kegg_entry END, '') AS a_kegg_entry
MERGE (a:GlycolysisMetabolites {name:a_name, abbreviation:a_abbreviation, kegg_entry:a_kegg_entry})
// Note: RETURN can only be used at the end of the query
RETURN a_name, a_abbreviation, a_kegg_entry;

OUTPUT:

$ cat glycolysis.cypher |  cypher-shell -u victoria -p <your_password>

a_name, a_abbreviation, a_kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"

Additional StackOverflow discussion on this topic/issue: https://stackoverflow.com/search?tab=votes&q=Neo4j%20use%20MERGE%20with%20null%20value


Addendum

Reference (e.g.): Neo4j CSV file load with empty cells

This "works", but SKIPS the creation of a node if any of the fields contains NULL values:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
FOREACH (
    x IN CASE WHEN row.abbreviation IS NULL OR row.kegg_entry IS NULL THEN [] ELSE [1] END |
    MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation: row.abbreviation, kegg_entry: row.kegg_entry})
    )
RETURN row.name, row.abbreviation, row.kegg_entry;

OUTPUT:

$ cat glycolysis.cypher |  cypher-shell -u victoria -p <password>

row.name, row.abbreviation, row.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"

Note that in the Neo4j Browser, only 9 (not 11) nodes are created: nodes for "D-glyceraldehyde 3-phosphate" and "pyruvate" are not created.

like image 25
Victoria Stuart Avatar answered Oct 21 '22 15:10

Victoria Stuart