Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the most efficient way to insert nodes into a neo4j database using cypher

I'm trying to insert a large number of nodes (~500,000) into a (non-embedded) neo4j database by executing cypher commands using the py2neo python module (py2neo.cypher.execute). Eventually I need to remove the dependence on py2neo, but I'm using it at the moment until I learn more about cypher and neo4j.

I have two node types A and B, and the vast majority of nodes are of type A. There are two possible relationships r1 and r2, such that A-[r1]-A and A-[r2]-B. Each node of type A will have 0 - 100 r1 relationships, and each node of type B will have 1 - 5000 r2 relationships.

At the moment I am inserting nodes by building up large CREATE statements. For example I might have a statement

CREATE (:A {uid:1, attr:5})-[:r1]-(:A {uid:2, attr:5})-[:r1]-...

where ... might be another 5000 or so nodes and relationships forming a linear chain in the graph. This works okay, but it's pretty slow. I'm also indexing these nodes using

CREATE INDEX ON :A(uid)

After I've add all the type A nodes, I add the type B nodes using CREATE statements again. Finally, I am trying to add the r2 relationships using a statement like

MATCH c:B, m:A where c.uid=1 AND (m.uid=2 OR m.uid=5 OR ...)
CREATE (m)-[:r2]->(c)

where ... could represent a few thousand OR statements. This seems really slow adding only a few relationships per second.

So, is there a better way to do this? Am I completely off track here? I looked at this question but this doesn't explain how to use cypher to efficiently load the nodes. Everything else I look at seems to use java, without showing the actual cypher queries could be used.

like image 925
savagent Avatar asked Jun 06 '13 00:06

savagent


People also ask

How do you optimize a Cypher query?

Each Cypher query gets optimized and transformed into an execution plan by the Cypher query planner. To minimize the resources used for this, try to use parameters instead of literals when possible. This allows Cypher to re-use your queries instead of having to parse and build new execution plans.

What are the advantages of Cypher projection?

Cypher projectionThe more flexible, expressive approach with lesser focus on performance. As I understand, all projections will end up: Graph projections are stored entirely in-memory using compressed data structures optimized for topology and property lookup operations.

What can you do to improve the performance of Neo4j?

The size of the available heap memory is an important aspect for the performance of Neo4j. Generally speaking, it is beneficial to configure a large enough heap space to sustain concurrent operations. For many setups, a heap size between 8G and 16G is large enough to run Neo4j reliably.

Which of the following is the correct way of creating a node in Neo4j?

In Neo4j, the CREATE statement is used to create a node. You can create the following things by using CREATE statement: Create a single node.


1 Answers

Don't create the index until the end (in 2.0). It will slow down node creation.

Are you using parameters in your Cypher?

I imagine you're losing a lot of cypher parsing time unless your cypher is exactly the same each time with parameters. If you can model it to be that, you'll see a marked performance increase.

You're already sending fairly hefty chunks in your cypher request, but the batch request API will let you send more than one in one REST request, which might be faster (try it!).

Finally, if this is a one time import, you might consider using the batch-import tool--it can burn through 500K nodes in a few minutes even on bad hardware... then you can upgrade the database files (I don't think it can create 2.0 files yet, but that may be coming shortly if not), and create your labels/index via Cypher.

Update: I just noticed your MATCH statement at the end. You shouldn't do it this way--do one relationship at a time instead of using the OR for the ids. This will probably help a lot--and make sure you use parameters for the uids. Cypher 2.0 doesn't seem to be able to do index lookups with OR, even when you use an index hint. Maybe this will come later.

Update Dec 2013: 2.0 has the Cypher transactional endpoint, which I've seen great throughput improvements on. I've been able to send 20-30k Cypher statements/second, using "exec" sizes of 100-200 statements, and transaction sizes of 1000-10000 statements total. Very effective for speeding up loading over Cypher.

like image 89
Eve Freeman Avatar answered Nov 14 '22 22:11

Eve Freeman