Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create billions of nodes in Neo4j?

I want to test the Neo4j performance with large number of nodes. I am thinking of creating billions of nodes and then want to see how much time it takes to fetch a node meeting some criteria. Like 1 billion nodes labeled Person having SSN property

match (p:Person) where p.SSN=4255556656425 return p;

But how can I create 1 billion nodes, is there a way to generate 1 billion nodes?

like image 789
Mahtab Alam Avatar asked Dec 04 '22 04:12

Mahtab Alam


2 Answers

What you would be measuring then is the performance of the lucene index. So not a graph-database operation.

There are a number of options:

neo4j-import

Neo4j 2.2.0-M03 comes with neo4j-import, a tool that can quickly and scalable import a 1 billion node csv into Neo4j.

parallel-batch-importer API

this is very new in Neo4j 2.2

I created a node-only Graph with 1.000.000.000 nodes in 5mins 13s (53G db) with the new ParallelBatchImporter. Which makes it about 3.2M nodes/second.

Code is here: https://gist.github.com/jexp/0ff850ab2ce41c9ca5e6

batch-inserter

You could use the Neo4j Batch-Inserter-API to create that data without creating the CSV first.

see this example here which you would have to adopt to not read CSV but generate the data directly from a for loop: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/

Cypher

If you want to use Cypher I'd recommend to run something like this in the JAVA_OPTS="-Xmx4G -Xms4G" bin/neo4j-shell -path billion.db:

Here is the code and timings for 10M and 100M I took on my macbook:

create a csv file with 1M lines

ruby -e 'File.open("million.csv","w") 
   { |f| (1..1000000).each{|i| f.write(i.to_s + "\n") }  }' 

Experiment running on a MacBook Pro Cypher execution is single threaded estimated size (15+42) bytes * node count

// on my laptop
// 10M nodes, 1 property, 1 label each in 98228 ms (98s) taking 580 MB on disk

using periodic commit 10000
load csv from "file:million.csv" as row
//with row limit 5000
foreach (x in range(0,9) | create (:Person {id:toInt(row[0])*10+x}));

// on my laptop
// 100M nodes, 1 property, 1 label each in 1684411 ms (28 mins) taking 6 GB on disk

using periodic commit 1000
load csv from "file:million.csv" as row
foreach (x in range(0,99) | create (:Person {id:toInt(row[0])*100+x}));

// on my linux server
// 1B nodes, 1 property, 1 label each in 10588883 ms (176 min) taking 63 GB on disk

using periodic commit 1000
load csv from "file:million.csv" as row
foreach (x in range(0,999) | create (:Person {id:toInt(row[0])*100+x}));

creating indexes

create index on :Person(id);
schema await

// took about 40 mins and increased the database size to 85 GB

then I can run

match (:Person {id:8005300}) return count(*);
+----------+
| count(*) |
+----------+
| 1        |
+----------+
1 row
2 ms
like image 88
Michael Hunger Avatar answered Dec 30 '22 11:12

Michael Hunger


The other simple answer is a good one. If you want something a bit more involved, Michael Hunger posted a good blog entry on this. He recommends something which is basically very similar, but you can loop with some sample data as well, and use random numbers to establish linkages.

Here's how he created 100,000 users and products and linked them, customize as you see fit:

WITH ["Andres","Wes","Rik","Mark","Peter","Kenny","Michael","Stefan","Max","Chris"] AS names
FOREACH (r IN range(0,100000) | CREATE (:User {id:r, name:names[r % size(names)]+" "+r}));

with ["Mac","iPhone","Das Keyboard","Kymera Wand","HyperJuice Battery",
"Peachy Printer","HexaAirBot",
"AR-Drone","Sonic Screwdriver",
"Zentable","PowerUp"] as names
    foreach (r in range(0,50) | create (:Product {id:r, name:names[r % size(names)]+" "+r}));

Let's not forget sweet random linkage:

match (u:User),(p:Product)
where rand() < 0.1
with u,p
limit 50000
merge (u)-[:OWN]->(p);

Go nuts.

like image 31
FrobberOfBits Avatar answered Dec 30 '22 10:12

FrobberOfBits