I have written a C++ program that simulates a certain process I'm studying. It outputs discrete "states" each timestep of the simulation. For example:
a
b
c
b
c
b
would be the output of a simulation run with a as the initial condition (set by me or randomly generated) and b & c would be the states the system keeps oscillating between.
I would like to combine many of these runs into a Markov chain, so that it turns into a graph with the following vertices and edges. (Preferably at runtime, because saving the output first takes a lot of diskspace.) The number between the parentheses indicate the number of times a certain vertex or edge was encountered, so this should also be stored.
Vertices: a(1), b(3) and c(2).
Edges: a->b(1), b->c(2), c->b(2).
The real states contain 112 bits of information and I'm generating billions of these transitions. The problem is that I haven't found a graph library or program to generate the Markov chain efficiently and fast. I have been toying around with:
I just completed the "Google sparse hash graph", but it turns out to get real slow halfway into the runs. After about a day (memory usage goes above 20 GB, not a problem in itself, because there is way more), it slows down and takes about three weeks to complete.
I have access to computers with 12 or 16 cores and 256 or 512 GB of memory, and my feeling is they should be up for the job.
Since I'm not a trained programmer and I code quite slowly, I'm looking for some information before I spent a lot of time working on another imperfect solution.
I hope I was able to make my issue clear. Thanks in advance for any wisdom or answers.
EDIT:
Based on the questions and answers in the comments I guess my question should have been: what is a suitable fast matrix library for C++?
Did you look at boost::numeric::ublas? It has a member sparse matrix that gives you matrix like access but instead of building a NxN array in memory keeps a list of edges per node.
So if N is the number of nodes instead of a NxN
array in memory you keep Nx30
-avg num of edges per node-
However even assuming you can use a single byte to count the reccurence of edges you still have 600M nodes each with a list of 30 edges.
the list entry is the edge name an uint32 and content is at least 1 byte. so 150 bytes minimum for the list. which comes out to a minimum 90GB in memory. likely higher because there is overhead per element in a list.
If you can keep this all in memory without OS swapping data to disk then there is no reason why it should not work fast. Of course it is possible that an ordered map will out perform a hash_map. It depends on implementation and the hash function used.
Naively std::map<uint32, std::map<uint32, unint8>>
If the tree is balanced the length of the big tree is 30, and the small one is tiny. So access shouldn't take ages. It is possible that a hash_map will work better for the columns though but not certain: hash_map<uint32, std::map<uint32, unint8>>
(google sparse hash map is tuned for memory not speed and the columns map will be very big which probably makes it a bad fit)
Finally you should consider holding this information on disk instead of in memory. In fact you can use an external data service like a DB with a table for each node (NodeId, NumOfHits) and a table for the edge (NodeId, NodeId, NumOfHits) {this representation takes up a lot more space}
I'd try something like Cassandra which can manage a disk vs memory cache for you and can easily scale for multiple computers. And you don't need to overhead of complex transaction models etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With