Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

triplestores vs native graph dbs on fast queries

I'm researching native graph databases and triple stores (RDF stores) for our use. We’re currently focused on Marklogic for triple store, and Neo4j, and maybe OrientDB for the native graph db.

Part A of this Q below is laying out the context-- i’m investigating a major distinction between these two types of DBs. I’m looking for verification on this first part-- whether i’m missing anything in this picture.

The second part-- Part B, I’m looking for answers on how much each DB has how much of those i’m outlining in Part A.

Part A:

AFAIK so far, a major distinction is-- triple-stores store relationships, or rather edges, based on the relationship itself. So, it's a "bag" of edges, each with a specific, well designed attributes on them to reflect the semantics of that relationship. Native graph dbs on the other hand, store the graph structure-- nodes and links on them, along with the attributes you'd like to define on these nodes and links.

I think, the following two would set two extremes for a fair view of these two. the following two are extremes-- i'm pretty sure the dbs out there are doing more than either one of these extremes.

1.) bag of edges (triple store): in the overall, each subject-predicate-object triple, say (sourceNode, edge, destNode) is stored as a single record, forming a triple store entry. The triple store is indexed on each of these 3 columns, so when i need a list of people who have friends that live in Australia, i (or rather, the triple store engine) quickly gets the “friends” relationships and among them, searches the ones that have a source or dest node where the node is a person and has the property “lives in Australia”.

2.) native graph: nodes with labels and properties, and the links in between. in order to find people "who have friends that live in Australia", i first find nodes that are labeled as "person", then i search the relationship list (which is a linked list (?)) of that node, and go from there. This is 2 searches, one on nodes and the second on the relationships of that node, as opposed to one search on the relationships (triples) of triple-stores.

One thing I kept seeing on the blogs so far as to the pros and cons of triple stores vs native graph dbs is, triplestores score on queries because of their indexing: the relationships can quickly be accessed. in a native graph db, relationships are accessed through nodes that they are incident to. (i'm aware that, by this very same token, native graph dbs have the advantage of retaining the graph structure so that graph algorithms and solutions can be implemented easier and run faster.)

However, the lack of indexing does not necessarily have a be a shortcoming of a native graph db if it allows indexing of nodes and/or relationships based on their properties and/or on their labels.

  • if it allows labeling of nodes and indexes on those labels, I as the developer can take a subgraph of the overall graph and go from there. Such query on a restricted domain would be much faster.

  • if it allows labeling of relationships, those queries "revolving around” relationships, like “list of people who have friends that live in Australia” above can execute faster. because the query won't traverse links from the nodes and look up the properties of nodes, but instead will look up and access links directly.

I am wondering how much of these are Marklogic, Neo4j and OrientDB doing?

I skimmed thru Chapter 6 of this book on Neo4j and haven’t seen anything about a direct search on an index of edges (relationships.) Have I missed anything?

If I did miss it and Neo4j has such indexing on edges, how come triple stores have the major advantage of fast queries over native graph dbs?

TIA.

//----------------------

EDIT:

Note: I've seen Graph DBs vs. Document DBs vs. Triplestores among some other useful discussions.

like image 916
user6401178 Avatar asked Jun 01 '16 20:06

user6401178


1 Answers

For Part A: Differences between a triple store and graph store - the differences aren't so much in what is stored, but more in how they are intended to be queried.

A graph store aims to answer graph queries. Things that include questions about the structure of the graph. This includes minimum distance between two points (E.g. route planning), perhaps with conditional evaluation (e.g. avoid motorways/highways, or I'm driving a caravan at a limit of 50mph), perhaps also including returning a calculated value (E.g. distance/time taken, best route steps). This could also include finding similar sub graphs, and various other graph-type queries.

A triple store aims to return information about a matching subject or subjects. E.g. "Find me all people who know other people that are members of an organisation of type Drug Gang, and return their personal profile information". In this query the bounds of the network you are querying are known (person -> person -> organisation -> org type), and you are returning a set of information (all 'person' assertions). This is a triple query.

Because of the nature of the above two query types you see very different physical architectures. Neo4j and most graph stores will adopt an 'all information on each node' approach, with multiple nodes being used to scale up query load. The other nodes contain a 100% copy of the data.

A triple store on the other hand (pure plays, or hybrid NoSQL databases like MarkLogic and OrientDB) are architected to split the data in to partitions/shards across multiple servers. This allows for linear scalability on commodity hardware rather than a large amount of data requiring a large piece of tin. The downside of course is if some of the data lies across multiple servers, you get a local network hit to complete a complex 'graph style' query.

This isn't to say graph stores cannot store triples (they do) or that triple stores cannot carry out graph queries (they can, you just have to construct it yourself) - but they're PRIMARILY built for different query types.

I have a Query Console example of graph queries across large datasets in MarkLogic's triple store, for example, that run in a few seconds rather than the usual milliseconds for 'normal' triple store queries.

There are open standards around triple stores led by the World Wide Web Consortium (W3C). These standards include RDF and SPARQL, and associated standards. Using open standards of course avoids vendor lock in to one product. MarkLogic Server and Allegrograph are both open standards compliant in this regard.

The downside of the W3C standards is, RDF has no concept of 'assertions about relationship' - i.e. it does not allow the storing of properties on the relationships themselves. Some graph stores like Neo4j do allow this. You can model this by have a relationship be a type of 'thing' in a triple store, but this isn't as nice a mental model to work in.

Where you have both documents and triples, a hybrid NoSQL database that natively supports indexing and querying of both is useful. MarkLogic Server and OrientDB both provide this. MarkLogic Server allows you to execute a structural (has/doesn't have element), field (exact match), range (less than, greater than), geospatial (point within an area, e.g. arbitrary polygon), bi-temporal (need more room to explain...) and semantic query in one hit against the same record. If you need something to cover both, you may want to look there.

At the risk of plugging my own work, I have published two books on the subject - NoSQL for Dummies (retail, 400 page version), and The State of NoSQL 2016 (kindle only) that will give you all the background you need. I've also blogged about related subjects on https://adamfowler.org/blog/ . Hope this helps.

like image 60
adamfowleruk Avatar answered Oct 04 '22 14:10

adamfowleruk