Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I process a graph that is constantly updating, with low latency?

I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).

This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.

The challenges I can foresee:

  • with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
  • with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
  • a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times

How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.

I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.

Thanks!

like image 491
user947659 Avatar asked Nov 24 '15 19:11

user947659


2 Answers

How do I start facing these challenges?

I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.

In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean

  1. "everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
  2. Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?

If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.

If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.

Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.

like image 183
eh9 Avatar answered Nov 03 '22 23:11

eh9


I don't know much about graphs, but I do understand a bit of networking.

One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.

All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.

The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.

Clients only need to know if there are new records or if old records have changed.

If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.

like image 38
flcoder Avatar answered Nov 04 '22 00:11

flcoder