I know this question has been asked several times before but the answers didn't solve my problem. I am trying to execute this query:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1 {address_name1:line1.address1})
but I get the error: Cannot merge node using null property value for address_name1
.
Other people suggested using:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1)
ON CREATE SET a.address_name1=line1.address1
ON MATCH SET a.address_name1=line1.address1
but this solution works if the node has more than one property. In my case it has only address_name1
property.
Is there a way to work around this problem, like replacing the null values with a word in the query before the MERGE
, or some other solution?
What is MERGE, and how does it work? The MERGE clause ensures that a pattern exists in the graph. Either the entire pattern already exists, or the entire pattern needs to be created. In this way, it's helpful to think of MERGE as attempting a MATCH on the pattern, and if no match is found, a CREATE of the pattern.
The function coalesce() returns the first non- null value in the given list of expressions. Syntax: coalesce(expression [, expression]*) Returns: The type of the value returned will be that of the first non- null expression.
You can merge a node in the database based on the label using the MERGE clause. If you try to merge a node based on the label, then Neo4j verifies whether there exists any node with the given label. If not, the current node will be created.
If there's no address, do you really need to create the Address
node at all?
You can filter the lines from the CSV using WITH
/WHERE
:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
WITH line1
WHERE NOT line1.address1 IS NULL
MERGE (a:Address1 {address_name1:line1.address1})
Otherwise, if you want to create a node representing the "unknown" address, you can use coalesce()
to substitute a default value:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///C:/Users/Zona5/Documents/Neo4j/checkIntel/import/personaldata.csv' AS line1
MERGE (a:Address1 {address_name1: coalesce(line1.address1, "Unknown")})
Hello: I am posting this rather extensive answer as I recently encountered surprising difficulty dealing with NULL (missing) values present in my CSV files, when attempting to load those data into Neo4j (neo4j 3.3.4).
I present three solutions.
I am using the Cycli (cycli 0.7.6) CLI, installed via pip in a Python 3.5 venv on an Arch Linux x86_64 system.
My CSV file (glycolysis_metabolites.csv) is:
name,abbreviation,kegg_entry
α-D-glucose,GLC,C00267
glucose 6-phosphate,G6P,C00668
fructose 6-phosphate,F6P,C05345
"fructose 1,6-bisphosphate",FBP,C05378
dihydroxyacetone phosphate,DHAP,C00111
D-glyceraldehyde 3-phosphate,,C00118
"1,3-bisphosphoglycerate","1,3-BPG",C00236
3-phosphoglycerate,3PG,C00197
2-phosphoglycerate,2PG,C00631
phosphoenolpyruvate,PEP,C00074
pyruvate,,C00022
Those data, copied from a PostgreSQL table via the psql /COPY ... command, have a "UNIQUE NOT NULL" constraint on the "name" field.
After surveying Google etc. I conducted three Experiments, below. Experiments 2 and 3 are basically the same.
I believe that the approach shown in Experiment 2 is the best solution, as the COALESCE statements are included within the MERGE statement.
My reason for this conclusion is that Experiment 2 uses "local" variables, rather than returning "global" variables (Experiment 3), thus minimizing unintended consequences on reused variable names.
I load my Cypher script as follows:
cat glycolysis_script.cypher | cypher-shell -u victoria -p <your_password>
** EXPERIMENT 1**
Reference: http://markhneedham.com/blog/2014/08/22/neo4j-load-csv-handling-empty-columns/
This solution (Mark Needham's) is pretty clever: it creates nodes containing all non-NULL properties, e.g.
<id>: 0 abbreviation: GLC kegg_entry: C00267 name: α-D-glucose
<id>: 10 kegg_entry: C00022 name: pyruvate
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
MERGE (a:GlycolysisMetabolites {name: row.name})
FOREACH(ignoreMe IN CASE WHEN row.abbreviation <> "" THEN [1] ELSE [] END | SET a.abbreviation = row.abbreviation)
FOREACH(ignoreMe IN CASE WHEN row.kegg_entry <> "" THEN [1] ELSE [] END | SET a.kegg_entry = row.kegg_entry)
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"
However, you cannot set your own MERGE specifications, on properties that include NULL values (here: "abbreviation") -- the reason being that you cannot MERGE on NULL property values.
Works:
MERGE (a:GlycolysisMetabolites {name: row.name})
Fails ("Cannot merge node using null property value for abbreviation"):
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:row.abbreviation, kegg_entry:row.kegg_entry})
EXPERIMENT 2
Reference: Neo4j use MERGE with null values
Here, I set an empty string ('') as a replacement for the NULL values present in the CSV file; you can use whatever you want; e.g.: 'Undefined', 'null', ...
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
// MERGE (a:GlycolysisMetabolites {name: row.name})
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation:COALESCE(row.abbreviation, ''), kegg_entry:COALESCE(row.kegg_entry, '')})
// With "USING PERIODIC COMMIT",
// RETURN a;
// throws this error: "Unknown value type: STRUCT"
// ... so, use this:
RETURN a.name, a.abbreviation, a.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a.name, a.abbreviation, a.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"
EXPERIMENT 3
References:
Neo4j use MERGE with null values
https://github.com/neo4j/neo4j/issues/2521
This also works, but as the COALESCE statements are outside the MERGE statement, I worry that the data returned by the RETURN statement could cause problems, if those variable names are reused elsewhere. As a workaround, I added a prefix (a_) as a quasi-UID, but I think the solution in Experiment 2, above, is the better approach.
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
WITH
COALESCE(CASE row.name WHEN '' THEN null ELSE row.name END, '') AS a_name,
COALESCE(CASE row.abbreviation WHEN '' THEN null ELSE row.abbreviation END, '') AS a_abbreviation,
COALESCE(CASE row.kegg_entry WHEN '' THEN null ELSE row.kegg_entry END, '') AS a_kegg_entry
MERGE (a:GlycolysisMetabolites {name:a_name, abbreviation:a_abbreviation, kegg_entry:a_kegg_entry})
// Note: RETURN can only be used at the end of the query
RETURN a_name, a_abbreviation, a_kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <your_password>
a_name, a_abbreviation, a_kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", "", "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", "", "C00022"
Additional StackOverflow discussion on this topic/issue: https://stackoverflow.com/search?tab=votes&q=Neo4j%20use%20MERGE%20with%20null%20value
Addendum
Reference (e.g.): Neo4j CSV file load with empty cells
This "works", but SKIPS the creation of a node if any of the fields contains NULL values:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/mnt/Vancouver/Programming/data/metabolism/pg2neo4j/glycolysis_metabolites.csv" AS row
FOREACH (
x IN CASE WHEN row.abbreviation IS NULL OR row.kegg_entry IS NULL THEN [] ELSE [1] END |
MERGE (a:GlycolysisMetabolites {name: row.name, abbreviation: row.abbreviation, kegg_entry: row.kegg_entry})
)
RETURN row.name, row.abbreviation, row.kegg_entry;
OUTPUT:
$ cat glycolysis.cypher | cypher-shell -u victoria -p <password>
row.name, row.abbreviation, row.kegg_entry
"α-D-glucose", "GLC", "C00267"
"glucose 6-phosphate", "G6P", "C00668"
"fructose 6-phosphate", "F6P", "C05345"
"fructose 1,6-bisphosphate", "FBP", "C05378"
"dihydroxyacetone phosphate", "DHAP", "C00111"
"D-glyceraldehyde 3-phosphate", NULL, "C00118"
"1,3-bisphosphoglycerate", "1,3-BPG", "C00236"
"3-phosphoglycerate", "3PG", "C00197"
"2-phosphoglycerate", "2PG", "C00631"
"phosphoenolpyruvate", "PEP", "C00074"
"pyruvate", NULL, "C00022"
Note that in the Neo4j Browser, only 9 (not 11) nodes are created: nodes for "D-glyceraldehyde 3-phosphate" and "pyruvate" are not created.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With