Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neo4j's MERGE command on big datasets

Currently, I am working on a project of implementing a Neo4j (V2.2.0) database in the field of web-analytics. After loading some samples, I'm trying to load a big data set (>1GB, >4M lines). The problem I am facing, is that the usage of the MERGE command takes exponentially more time as the data size grows. Online sources are ambiguous on what the best way is to load big sets of data when not every line has to be loaded as a node, and I would like some clarity on the subject. To emphasize, in this situation I am just loading the nodes; relations are the next step.

Basically there are three methods

i) Set a uniqueness constraint for a property, and create all nodes. This method was used mainly before the MERGE command was introduced.

CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE

followed by

USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
CREATE (:Book{isbn=row.isbn, title=row.title, etc})

In my experience, this will return a error if a duplicate is found, which stops the query.

ii) Merging the nodes with all their properties.

USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (:Book{isbn=row.isbn, title=row.title, etc})

I have tried loading my set in this manner, but after letting the process run for over 36 hours and coming to a grinding halt, I figured there should be a better alternative, as ~200K of my eventual ~750K nodes were loaded.

iii) Merging nodes based on one property, and setting the rest after that.

USING PERIODIC COMMIT 250
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author
etc

I am running a test now (~20K nodes) to see if switching from method ii to iii will improve execution time, as a smaller sample gave conflicting results. Are there methods which I am overseeing and could improve execution time? If I am not mistaken, the batch inserter only works for the CREATE command, and not the MERGE command.

I have permitted Neo4j to use 4GB of RAM, and judging from my task manager this is enough (uses just over 3GB).

like image 793
Michiel van Zummeren Avatar asked Jan 01 '26 04:01

Michiel van Zummeren


1 Answers

Method iii) should be the fastest solution since you MERGE against a single property. Do you create the uniqueness constraint before you do the MERGE? Without an index (constraint or normal index), the process will take a long time with a growing number of nodes.

CREATE CONSTRAINT ON (book:Book) ASSERT book.isbn IS UNIQUE

Followed by:

USING PERIODIC COMMIT 20000
LOAD CSV WITH HEADERS FROM "file:C:\\path\\file.tsv" AS row FIELDTERMINATOR'\t'
MERGE (b:Book{isbn=row.isbn})
ON CREATE SET b.title = row.title
ON CREATE SET b.author = row.author

This should work, you can increase the PERIODIC COMMIT.

I can add a few hundred thousand nodes within minutes this way.

like image 177
Martin Preusse Avatar answered Jan 05 '26 12:01

Martin Preusse



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!