Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.

  • I have three datasets, each containing about 20 milions rows (csv files)
  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.

Possibly there exist scripts for this. But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...? Or am I completely misinterpreting Janusgraph/Tinkerpop?

Thanks for any help in advance.

EDIT:

Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:

             metric_1    metric_2    metric_3    ..

person_1        a           e           i
person_2        b           f           j
person_3        c           g           k
person_4        d           h           l
..        

Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l]. (and later perhaps more elaborate sets of properties)

And are [a,..., l] then indexed?

The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?

Apologies for these probably straightforward questions, but I'm fairly new to this.

like image 826
nikolai Avatar asked Nov 13 '18 20:11

nikolai


2 Answers

Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)

All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).

  • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
  • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
  • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
  • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
  • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.

All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.

All the best!

like image 124
Don Omondi Avatar answered Sep 28 '22 20:09

Don Omondi


JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)

cd /path/to/janus
bin/janusgraph.sh start

Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console

bin/gremlin.sh -e scripts/load_data.script 

An efficient way to load the data is to split it into two files:

  • nodes.csv: one line per node with all attributes
  • links.csv: one line per link with source_id and target_id and all the links attributes

This might require some data preparation steps.

Here is an example script

The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.

Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script

like image 31
Benoit Guigal Avatar answered Sep 28 '22 20:09

Benoit Guigal