Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How is a graph database different to a graph represented in a relational database?

I can represent a graph trivially in a relational database with two tables: vertex and edge. Richer structure like "properties" and "labels" (in Neo4j terminology) can be represented as more tables. Have I misunderstood, or does a graph database like Neo4j allow me to represent anything that is not easily representable relationally?

I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. Have I misunderstood, or does a graph query language like Cypher provide greater expressivity than SQL?

The relational model of a graph is stored and queried efficiently, AFAIK. Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?

My relational database provides ACID guarantees, and allows me to write fairly expressive constraints on my graph data (and even more constraints if I break out the single vertex table into a properly normalized schema). Have I misunderstood, or does a graph database provide some guarantees or verify some kind of correctness properties that are not available in my relational database?

I am struggling to see how a graph database such as Neo4j is anything but a subset of the relational model. (Apologies for using Neo4j as representative of all graph databases here; it's the only one I've looked at.)

In short: Is graph database ⊆ relational database?

like image 707
jameshfisher Avatar asked Oct 10 '14 17:10

jameshfisher


1 Answers

Is One a Subset of the Other?

Definitely no; both are eventually modeled on the mathematical concepts of relations or graphs. Both models being super-general, there is basically no information content that you can't represent using either one. This means that while they might differ in many syntactic sugar ways, and in the way they encourage you to model/think of data (just like programming languages differ) they both have the same "expressive power".

What you describe in your question is one way of modeling a graph (vertex and edge tables). That implementation of a graph is a subset of what relational can express. Similarly, I could mock up tables and rows using a graph database, but I would have chosen a particular implementation - this wouldn't demonstrate that relational data is a subset of graph data.

So the first insight is that they have roughly equal expressive power. You can model anything in either. So the real question you should be asking is why would you choose one over the other?

Why Would you Choose One Over The Other?

All databases exist to facilitate data access. Simply put, you store it so that you can get at the data. But exactly how do you need to get at the data? There are many different access patterns. The design space for databases in general is enormous. Any time a database makes a certain decision, that tends to automatically make it better at some things, worse at others. For example, when you create an index in a relational database, you've just sped up reads -- but you've degraded the performance of writes, because the index has to be maintained.

So, when approaching the question, "Graph or Relational?" - you should first figure out what does your data look like, and what do your data access patterns look like. If you knew what those things were, then you could evaluate a bunch of databases, see the choices they've made, and pick the one that's a good fit for what you need. And then if a DBMS made a choice that would make certain access patterns difficult, buggy, or slow -- you could avoid that DBMS for that data set.

It's (Partly) About Data Access Patterns

Graph databases tend to be better than relational when the data being stored is a graph, when the data access pattern involves a lot of graph traversal, or both. (See this other answer I wrote for a more in-depth discussion of why this is). That link there also provides the answer to your specific question: "Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?"

You say: I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. -- So technically this is true, but let's take an example to see why relational might not be good enough. Say I have a graph (in RDBMS, a table of nodes, a table of edges, with a join key between them). Let's say I pick out one node, and I want to identify everything that is between 6 and 8 hops away from that node. Here's the cypher to do that:

match (myChosenNode {id: 'foo'})-[r:relationshipType*6..8]->(y) return y;

I really want to see you write that up as SQL. It's possible, but it's hard and complicated. And it will also perform like a dog, because of the sheer quantity of joining you'll be doing on non-trivial quantities of data.

ACID

OK now on the ACID guarantees, Neo4J provides transactions with ACID guarantees. The answer will be different for different graph databases though, particularly the ones implemented on top of Hadoop/HBase. YMMV there, so check the fine print with each database.

It is true that there are a number of features of RDBMS that you typically won't find in graph databases, examples being triggers and certain kinds of constraints. As a long-time RDMBS nerd myself, I'm not so happy about those things being missing, I think they are valuable.

Summary

What this mostly boils down to for me, and many other engineers I work with is:

  1. What is your data?
  2. What are your access patterns?

If your data is a graph, or your access patterns involve a lot of graph traversal, you should probably use a graph DB. If your data is more tabluar, or your access patterns are more oriented around bulk scans, then you should use RDBMS. At the end of the day, they're two different tools with different niches. If you use them in their area of strength, you'll be happy. If you use RDBMS to model a graph just "because you can", you'll suffer. If you use a graph database to do a lot of bulk scans of every node in every graph, you'll suffer. Like most of tech, it's just about using the right tool for the job.

like image 144
FrobberOfBits Avatar answered Oct 02 '22 07:10

FrobberOfBits