Working on loading data to Neptune using gremlin , Having Neptune Infrastructure of DB Instance size (db.r5.4xlarge(16 vCPUs)). Data is loaded to Neptune via AWS Glue job with 5 worker threads using pyspark.
Loading data by doing an upsertion with deduped dataset and batching(50 records/batch) them together as single query to Neptune,
Vertices : Compute all vertices to be loaded in graph after deduping (There are no duplicate nodes)
Query used :
g.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
.V().has(T.id, record.id).fold().coalesce(__.unfold(),__.addV(record.source).property(T.id, record.id)
(Do 48 items).next()
Time taken to perform for 2.45M unique vertices is 5 mins
Edges: Compute all edges to be loaded in graph after deduping (There are no duplicate edges)
Query used :
g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2))).property(Cardinality.single, timestamp, edgeData.timestamp).property(Cardinality.single, count, edgeData.count)
(Do 48 items).next()
Time taken to perform for 1.88M unique edges with properties is 21 mins
If we perform just edge creation alone without any properties to edge ,
Query used :
g.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
.V(edgeData.id1).bothE().where(__.otherV().hasId(edgeData.id2)).fold().coalesce(__.unfold(),__.addE('coincided_with').from_(__.V(edgeData.id1)).to(__.V(edgeData.id2)))
(Do 48 items).next()
Time taken to perform for 1.88M unique edges without properties is 4 mins
Performance Issues:
Any suggestions to improve performance would be much of help
With this many vertices and edges it might be an idea to just use the bulk uploader where you create CSV files and import them from S3: https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format.html
Tip: Put the Curl commands for the loader into a Sagemaker Notebook so you can run them from there.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With