Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Graph DBs vs. Document DBs vs. Triplestores

This is a somewhat abstract and general question. I'm interested in the inherent (as well as implementation-specific) properties of different approaches to persist unstructured data with both lots of internal references (graph-like) and lots of properties (JSON-like).

  • Since a graph is a superset of a tree, you can look at graph DBs (e.g. Neo4j) as a superset of document DBs (e.g. MongoDB). That is, a graph DB provides all the functionality of a document DB plus additionally also allows loops or has a native pointer type so you don't have to dereference foreign-keys/ids manually. So is there some tipping point that you reach when adding more references to your objects/resources where you're better off with a graph DB but were previously better off with a document store? Are there advantages to document DBs (storage space, performance?) or should you just always go with a graph DB just in case you'll need more references in the future?

  • Similarly, how do graph DBs and triplestores (e.g. RDF stores) compare? Graph DBs (where nodes and edges have properties) seem to be a superset of the simple triplestores. So for what problems (if any) perform triplestores actually better then, say Neo4j? (One advantage of RDF stores is that there is a standardized query language – SPARQL – although there seem to be a lot of people that don't like SPARQL and thus would call it a disadvantage.)

I guess my question is: The graph model (with properties) seems to be able to neatly express all kinds of data, what is the catch when you enter reality? I suppose the catch of graph DBs is performance, so I'd love to see some numbers or rules of thumb on what kind of slowdowns to expect when loading, querying and modifying data as well as memory, and persistent storage requirements (compared to document and triple stores). Also what about horizontal scalability? I got the impression that there the playing field is quite level.

Do you think it is possible that graphs with their expressibility will become the new default storage model for projects that have not super-large data, or are we doomed for a decade of Polyglot Persistence with RDBMS, JSON stores and Graph DBs living along each other that have to be integrated with even more glue code?

like image 498
mb21 Avatar asked Aug 20 '12 18:08

mb21


People also ask

What are graph DBS good for?

Graph databases have advantages for use cases such as social networking, recommendation engines, and fraud detection, when you need to create relationships between data and quickly query these relationships. The following graph shows an example of a social network graph.

Are graph databases better?

A graph database is just a data store and doesn't give you a business-facing user interface to query or manage relationships. Also, it will not provide advanced match and survivorship functionality or data quality capabilities. Graph databases do not create better relationships.

Is NoSQL a DBS graph?

Graph databases are commonly referred to as a NoSQL. Graph databases are similar to 1970s network model databases in that both represent general graphs, but network-model databases operate at a lower level of abstraction and lack easy traversal over a chain of edges.


1 Answers

I'm not sure I would agree with the sentiment that a lot of people don't like SPARQL. SPARQL 1.0 did have some short comings, but it quite nicely addressed what it was designed for, and the new iteration, SPARQL 1.1, builds upon it adding many constructs from SQL that people expected to see in the original spec including sub-queries, aggregates & update semantics. I think the fact that it's standard and you can expect to see the same parsing & semantics in every triple store, as opposed to dialects of SQL, is a nice feature.

I would also claim that all triple stores are graph databases; you can put properties on specific edges in RDF, albeit not as nicely as you can w/ Neo4j. But triple stores have the advantage of a real query language, a w3c standard data representation which makes it trivial to take your data to another triplestore, and for a number of triple stores, the ability to perform reasoning based on OWL.

I dont know anything about the scalability for most graph db's, but generally, the commercial RDF databases scale quite well. All can scale into the billions of triples, which handles a great many use cases. Though how they handle scale differs wildly from vendor to vendor wrt to scale up or scale out, clustering, etc. You'll also see pretty different mem & hardware requirements to match the implementations for each. For me, I've tended to just go and grab an EC2 instance, usually a 2XL or 4XL, mount an EBS large enough to hold the data, and I'm pretty well set.

Additionally, some triple stores integrate with Lucene or similar technologies to provide inverted indexes over the data, and many now are starting to include geo-spatial and temporal indexes. These are very useful features that I'm not sure of their availability in something like Neo4j.

With that said, they're not going to scale as well as a relational databases, they're just not as mature. But you're also not going to get screwed when you have "real" amounts of data either. Of course, one of the advantages of triples stores is reasoning, which performing at scale is tricky, but that's much of the reason why the various OWL profiles were created. But you can paint yourself into a corner if you don't think ahead.

I think graph databases, triple stores specifically, can be a pretty good match for a lot of applications that are being built, but I dont think that means that everything should be done with them. Like anything else, they're tools w/ their good points and their bad points, so you kind of have to make the right choice based on your application. But they probably always merit at least a consideration these days.

like image 107
Michael Avatar answered Sep 21 '22 06:09

Michael