Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

olap and oltp queries in gremlin

In gremlin,

  1. s = graph.traversal()

  2. g = graph.traversal(computer())

i know the first one is for OLTP and second for OLAP. I know the difference between OLAP and OLTP at definition level.I have the following queries on this:

How does

  1. the above queries differ in working?
  2. Can I use second one ,using'g' in queries in my application to get results(I know this 'g' one gives gives results faster than first one )?
  3. Difference between OLAP and OLTP with example ?

Thanks in advance.

like image 546
Ahi Avatar asked Dec 24 '22 00:12

Ahi


1 Answers

From the user's perspective, in terms of results, there's no real difference between OLAP and OLTP. The Gremlin statements are the same save for configuration of the TraversalSource as you have shown with your use of withComputer() and other settings.

The difference is more in how the traversal is executed behind the scenes. OLAP-based traversals are meant to process the "entire graph" (i.e. all vertices/edges and perhaps more than once). Where OLTP based traversals are meant to process smaller bodies of data, typically starting with one or a handful of vertices and traversing from there. When you consider graphs in the scale of "billions of edges", it's easy to understand why an efficient mechanism like OLAP is needed to process such graphs.

You really shouldn't think of OLTP vs OLAP as "faster" vs "slower". It's probably better to think of it as it is described in the documentation:

  • OLTP: real-time, limited data accessed, random data access, sequential processing, querying
  • OLAP: long running, entire data set accessed, sequential data access, parallel processing, batch processing

There's no reason why you can't use an OLAP traversal in your applications so long as your application is aware of the requirements of that traversal. If you have some SLA that says that REST requests must complete in under 0.5 seconds and you decide to use an OLAP traversal to get the answer, you will undoubtedly break your SLA. Assuming you execute the OLAP traversal job over Spark, it will take Spark 10-15 seconds just to get organized to run your job.

I'm not sure how to provide an example of OLAP and OLTP, except to talk about the use cases a little bit more, so it should be clear as to when to use one as opposed to the other. In any case, let's assume you have a graph with 10 billion edges. You would want your OLTP traversals to always start with some form of index lookup - like a traversal that shows the average age of the friends of the user "stephenm":

g.V().has('username','stephenm').out('knows').values('age').mean()

but what if I want to know the average age of every user in my database? In this case I don't have any index I can use to lookup a "small set of starting vertices" - I have to process all the many millions/billions of vertices in my graph. This is a perfect use case for OLAP:

g.V().hasLabel('user').values('age').mean()

OLAP is also great for understanding growth of your graph and for maintaining your graph. With billions of edges and a high data ingestion rate, not knowing that your graph is growing improperly is a death sentence. It's good to use OLAP to grab global statistics over all the data in the graph:

g.E().label().groupCount()
g.V().label().groupCount()

In the above examples, you get an edge/vertex label distribution. If you have an idea as to how your graph is growing, this can be a good indicator of whether or not your data ingestion process is working properly. On a billion edge graph, trying to execute even one of the traversals would take "forever" if it ever finished at all without error.

like image 57
stephen mallette Avatar answered Jan 15 '23 03:01

stephen mallette