Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best/fastest way to construct a very large markov chain from simulation data?

I have written a C++ program that simulates a certain process I'm studying. It outputs discrete "states" each timestep of the simulation. For example:

a
b
c
b
c
b

would be the output of a simulation run with a as the initial condition (set by me or randomly generated) and b & c would be the states the system keeps oscillating between.

I would like to combine many of these runs into a Markov chain, so that it turns into a graph with the following vertices and edges. (Preferably at runtime, because saving the output first takes a lot of diskspace.) The number between the parentheses indicate the number of times a certain vertex or edge was encountered, so this should also be stored.

Vertices: a(1), b(3) and c(2).

Edges: a->b(1), b->c(2), c->b(2).

The real states contain 112 bits of information and I'm generating billions of these transitions. The problem is that I haven't found a graph library or program to generate the Markov chain efficiently and fast. I have been toying around with:

  • Google sparse hash to construct my own graph class in C++.
  • Neo4J (I was just getting started with this one)
  • Lemon library

I just completed the "Google sparse hash graph", but it turns out to get real slow halfway into the runs. After about a day (memory usage goes above 20 GB, not a problem in itself, because there is way more), it slows down and takes about three weeks to complete.

I have access to computers with 12 or 16 cores and 256 or 512 GB of memory, and my feeling is they should be up for the job.

Since I'm not a trained programmer and I code quite slowly, I'm looking for some information before I spent a lot of time working on another imperfect solution.

  • What would be the best program/library that can quickly accept large numbers of vertices and edges to construct the Markov chain?
  • Is the slowness a result of using the wrong tools or imperfect coding (which I suspect) or am I simply trying to do something that will always take a lot of time?

I hope I was able to make my issue clear. Thanks in advance for any wisdom or answers.

EDIT:

Based on the questions and answers in the comments I guess my question should have been: what is a suitable fast matrix library for C++?

like image 536
jlmr Avatar asked Oct 27 '13 10:10

jlmr


1 Answers

Did you look at boost::numeric::ublas? It has a member sparse matrix that gives you matrix like access but instead of building a NxN array in memory keeps a list of edges per node.

So if N is the number of nodes instead of a NxN array in memory you keep Nx30 -avg num of edges per node-

However even assuming you can use a single byte to count the reccurence of edges you still have 600M nodes each with a list of 30 edges.

the list entry is the edge name an uint32 and content is at least 1 byte. so 150 bytes minimum for the list. which comes out to a minimum 90GB in memory. likely higher because there is overhead per element in a list.

If you can keep this all in memory without OS swapping data to disk then there is no reason why it should not work fast. Of course it is possible that an ordered map will out perform a hash_map. It depends on implementation and the hash function used.

Naively std::map<uint32, std::map<uint32, unint8>> If the tree is balanced the length of the big tree is 30, and the small one is tiny. So access shouldn't take ages. It is possible that a hash_map will work better for the columns though but not certain: hash_map<uint32, std::map<uint32, unint8>> (google sparse hash map is tuned for memory not speed and the columns map will be very big which probably makes it a bad fit)

Finally you should consider holding this information on disk instead of in memory. In fact you can use an external data service like a DB with a table for each node (NodeId, NumOfHits) and a table for the edge (NodeId, NodeId, NumOfHits) {this representation takes up a lot more space}

I'd try something like Cassandra which can manage a disk vs memory cache for you and can easily scale for multiple computers. And you don't need to overhead of complex transaction models etc.

like image 69
odedsh Avatar answered Sep 28 '22 07:09

odedsh