Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neo4j how to model a time-versioned graph

Part of my graph has the following schema:

enter image description here

Main part of the graph is the domain, that has some persons linked to it. Person has a unique constraint on the email property, as I also have data from other sources and this fits nicely.

A person can be an admin in my case, where he has some devices/calendars linked to him. I get this data from an SQL db, where I import few tables to combine the whole picture. I start with a table, that has two columns, email of the admin and his user id. This user id is specific only for production database and is not globally used for other sources as well. That is why I use email as global ID for persons. I am currently using the following query to import user id, that all the production tables are linked to. I always get the current snapshot of the user settings and info. This query runs 4x/day:

CALL apoc.load.jdbc(url, import_query) yield row
MERGE (p:Person{email:row.email})
SET p.user_id = row.id

And then I import all the data that is linked to this user id from other tables.

Now the problem occurs, because the user from production db can change his email. So the way I am importing this right now I will end up with two persons having the same user_id and subsequently all the devices/calendars will be linked to both persons, as they both share the same user_id. So this is not an accurate representation of the reality. We also need to capture the connecting/disconnecting of devices to particular user_id through time, as one can connect/disconnect a device and loan it to a friend, that has a different admin (user_id).

How to change my graph model ( importing query ), so that :

  1. Querying who is currently the admin will not require complex queries
  2. Querying who has currently the device connected will not require complex queries
  3. Querying history can be a bit more complex.
like image 918
Tomaž Bratanič Avatar asked Aug 14 '17 08:08

Tomaž Bratanič


People also ask

What is data Modelling in Neo4j?

The data model in Neo4j organizes data using the concepts of nodes and relationships. Both nodes and relationships can have properties, which store the data items associated with nodes and relationships. Nodes can have labels: A node can have zero, one, or several labels.

Is Neo4j real time?

By switching from MySQL to Neo4j, they were able to provide powerful, real-time recommendations on both the back and front-end of the user experience.

Is Neo4j faster than PostgreSQL?

We found that, while Neo4j is more time intensive to implement, its queries are less complex and have a faster runtime than comparable queries performed in PostgreSQL.


2 Answers

This answer is based on Ian Robinson's post about time-based versioned graphs.

I don't know if this answer covers ALL the requirements of the question, but I believe that can provide some insights.

Also, I'm considering you are only interested in structural versioning (that is: you are not interested in queries about the changes of the domain user's name over the time). Finally, I'm using a partial representation of your graph model, but I believe that the concepts shown here can be applied in the whole graph.

The initial graph state:

Considering this Cypher to create an initial graph state:

CREATE (admin:Admin)

CREATE (person1:Person {person_id : 1})
CREATE (person2:Person {person_id : 2})
CREATE (person3:Person {person_id : 3})

CREATE (domain1:Domain {domain_id : 1})

CREATE (device1:Device {device_id : 1})

CREATE (person1)-[:ADMIN {from : 0, to : 1000}]->(admin)

CREATE (person1)-[:CONNECTED_DEVICE {from : 0, to : 1000}]->(device1)

CREATE (domain1)-[:MEMBER]->(person1)
CREATE (domain1)-[:MEMBER]->(person2)
CREATE (domain1)-[:MEMBER]->(person3)

Result:

Initial Graph state

The above graph has 3 person nodes. These nodes are members of a domain node. The person node with person_id = 1 is connected to a device with device_id = 1. Also, person_id = 1 is the current administrator. The properties from and to inside the :ADMIN and :CONNECTED_DEVICE relationships are used to manage the history of the graph structure. from is representing a start point in time and to an end point in time. For simplification purpose I'm using 0 as the initial time of the graph and 1000 as the end-of-time constant. In a real world graph the current time in milliseconds can be used to represent time points. Also, Long.MAX_VALUE can be used instead as the EOT constant. A relationship with to = 1000 means there is no current upper bound to the period associated with it.

Queries:

With this graph, to get the current administrator I can do:

MATCH (person:Person)-[:ADMIN {to:1000}]->(:Admin)
RETURN person

The result will be:

╒═══════════════╕
│"person"       │
╞═══════════════╡
│{"person_id":1}│
└───────────────┘

Given a device, to get the current connected user:

MATCH (:Device {device_id : 1})<-[:CONNECTED_DEVICE {to : 1000}]-(person:Person)
RETURN person

Resulting:

╒═══════════════╕
│"person"       │
╞═══════════════╡
│{"person_id":1}│
└───────────────┘

To query the current administrator and the current person connected to a device the End-Of-Time constant is used.

Query the device connect / disconnect events:

MATCH (device:Device {device_id : 1})<-[r:CONNECTED_DEVICE]-(person:Person)
RETURN person AS person, device AS device, r.from AS from, r.to AS to
ORDER BY r.from

Resulting:

╒═══════════════╤═══════════════╤══════╤════╕
│"person"       │"device"       │"from"│"to"│
╞═══════════════╪═══════════════╪══════╪════╡
│{"person_id":1}│{"device_id":1}│0     │1000│
└───────────────┴───────────────┴──────┴────┘

The above result shows that person_id = 1 is connected to device_id = 1 of the beginning until today.

Changing the graph structure

Consider that the current time point is 30. Now user_id = 1 is disconnecting from device_id = 1. user_id = 2 will connect to it. To represent this structural change, I will run the below query:

// Get the current connected person
MATCH (person1:Person)-[old:CONNECTED_DEVICE {to : 1000}]->(device:Device {device_id : 1})
// get person_id = 2
MATCH (person2:Person {person_id : 2}) 
 // set 30 as the end time of the connection between person_id = 1 and device_id = 1
SET old.to = 30
// set person_id = 2 as the current connected user to device_id = 1
// (from time point 31 to now)
CREATE (person2)-[:CONNECTED_DEVICE {from : 31, to: 1000}]->(device) 

The resultant graph will be:

Graph after structural change

After this structural change, the connection history of device_id = 1 will be:

MATCH (device:Device {device_id : 1})<-[r:CONNECTED_DEVICE]-(person:Person)
RETURN person AS person, device AS device, r.from AS from, r.to AS to
ORDER BY r.from

╒═══════════════╤═══════════════╤══════╤════╕
│"person"       │"device"       │"from"│"to"│
╞═══════════════╪═══════════════╪══════╪════╡
│{"person_id":1}│{"device_id":1}│0     │30  │
├───────────────┼───────────────┼──────┼────┤
│{"person_id":2}│{"device_id":1}│31    │1000│
└───────────────┴───────────────┴──────┴────┘

The above result shows that user_id = 1 was connected to device_id = 1 from 0 to 30 time. person_id = 2 is currently connected to device_id = 1.

Now the current person connected to device_id = 1 is person_id = 2:

MATCH (:Device {device_id : 1})<-[:CONNECTED_DEVICE {to : 1000}]-(person:Person)
RETURN person

╒═══════════════╕
│"person"       │
╞═══════════════╡
│{"person_id":2}│
└───────────────┘

The same approach can be applied to manage the admin history.

Obviously this approach has some downsides:

  • Need to manage a set of extra relationships
  • More expensive queries
  • More complex queries

But if you really need a versioning schema I believe this approach is a good option or (at least) a good start point.

like image 172
Bruno Peres Avatar answered Sep 18 '22 15:09

Bruno Peres


Resolving a GUID

The first thing you need is to reliably resolve user ids so that they are consistent and globally unique. Now you said

user id is specific only for production database and is not globally used for other sources

From this, I can infer 2 things

  1. Users exist from multiple sources.
  2. For each source, users have a unique id.

So that means that source + user.id will be a GUID. (You can hash the main connection url or name each source externally) I will assume you aren't merging users across multiple sources, because duplicating and merging data over any network creates an update order paradox that should be avoided as much as possible (If two sources list different new contact numbers, who is correct?).

Querying current data

The querying logic should be agnostic to any version tracking you may be doing. If your versioning causes problems with the logic, add a meta label like :Versioned with indexed property isLatest and tack on a Where n.isLatest to filter out the old "garbage" data from your results.

So no that you don't need to worry about version, Queries 1 and 2 can be handled normally.

  1. For finding people who are admins, I would recommend just adding a the label :Admin to the person and removing it when it no longer applies (as needed). This comes with being indexed by the label "Admin". You can also just use an "isAdmin" property (which is probably how you are already storing it in the db, so more consistent.) So the final query would just be MATCH (p:Person:Admin) or MATCH (p:Person{isAdmin:true}).

  2. With the old version information filtered out, the query for who has a device would simply be MATCH (p:Person:Versioned{isCurrent:true})-[:HasDevice{isConnected:true}]->(d:Device:Versioned{isCurrent:true})

This bit really just boils down to "What is your schema?"

Data History

This bit is where it really gets tricky. Depending on how you version the data, You can easily end up blowing up your data size and killing your DB performance. You REALLY need to ask yourself "Why am I versioning this?", "How often will this update/be read?", "Who will use it and What will they do with it?". If at any point you answer "I don't know/care", you either shouldn't do this, or backup your data in a database that natively handles this for you like SQLAlchemy-Continuum. (Related answer)

If you must do this in Neo4j, than I would recommend using a delta chain. So if for example, you changed {a:1, b:2} to {a:1, b:null, c:3}, You would have (:Thing{a:1, b:null, c:3})-[_DELTA{timestamp:<value>}]->(:_ThingDelta{b: 2, c:null}). That way, to get a past value you just chain-apply the properties of the delta chain into a map. So MATCH (a:Thing) OPTIONAL MATCH (a)-[d:_DELTA*]->(d) WHERE d.timestamp >= <value> WITH reduce(v = {_id:ID(a)}, n IN nodes(p)| v += PROPERTIES(n)) AS OldVersion. This can get very tedious though and eat up your DB Space, so I would highly recommend using some existing db versioning thing at all costs if you can.

like image 26
Tezra Avatar answered Sep 18 '22 15:09

Tezra