Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Managing a stateful computation system in Haskell

So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.

I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.

I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)

I had been approaching it using closures to hold state. But I'm wondering if it would make more sense to use a monad. I have only read through 3 or 4 analogies for monads so don't understand them well yet, so feel free to educate me.

If each processor modifies its state in the monad in such a way that its history is kept and it is tied to an id for each processing step. And then somehow the monad is able to switch its state to a past step id and run the system with the monad in that state. And the monad would have some mechanism for (de)serializing itself for storage.

(and given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...)

Is there an existing library/mechanism/approach/concept that has already been done to accomplish or assist in accomplishing what I'm trying to do?

like image 535
mentics Avatar asked Jul 26 '11 19:07

mentics


1 Answers

So, I have a system of stateful processors that are chained together. For example, a processor might output the average of its last 10 inputs. It requires state to calculate this average.

First of all, it sounds like what you have are not just "stateful processors" but something like finite-state machines and/or transducers. This is probably a good place to start for research.

I would like to submit values to the system, and get the outputs. I also would like to jump back and restore the state at any time in the past. Ie. I run 1000 values through the system. Now I want to "move" the system back to exactly as it was after I had sent the 500th value through. Then I want to "replay" the system from that point again.

The simplest approach here, of course, is to simply keep a log of all prior states. But since it sounds like you have a great deal of data, the storage needed could easily become prohibitive. I would recommend thinking about how you might construct your processors in a way that could avoid this, e.g.:

  • If a processor's state can be reconstructed easily from the states of its neighbors a few steps prior, you can avoid logging it directly
  • If a processor is easily reversible in some situations, you don't need to log those immediately; rewinding can be calculated directly, and logging can be done as periodic snapshots
  • If you can nail a processor down to a very small number of states, make sure to do so.
  • If a processor behaves in very predictable ways on certain kinds of input, you can record that as such--e.g., if it idles on numeric input below some cutoff, rather than logging each value just log "idled for N steps".

I also need to be able to persist the historical state to disk so I can restore the whole thing again some time in the future (and still have the move back and replay functions work). And of course, I need to do this with gigabytes of data, and have it be extremely fast :)

Explicit state is your friend. Functions are a convenient way to represent active state machines, but they can't be serialized in any simple way. You want a clean separation of a (basically static) network of processors vs. a series of internal states used by each processor to calculate the next step.

Is there an existing library/mechanism/approach/concept that has already been done to accomplish what I'm trying to do? Does the monad approach make sense? Are there other better/special approaches that would help it do this efficiently especially given the enormous amount of data I have to manage?

If most of your processors resemble finite state transducers, and you need to have processors that take inputs of various types and produce different types of outputs, what you probably want is actually something with a structure based on Arrows, which gives you an abstraction for things that compose "like functions" in some sense, e.g., connecting the input of one processor to the output of another.

Furthermore, as long as you avoid the ArrowApply class and make sure that your state machine model only returns an output value and a new state, you'll be guaranteed to avoid implicit state because (unlike functions) Arrows aren't automatically higher-order.

Given the size of the data... it really shouldn't even all be in memory, which would mean the monad would need to be mapped to disk, cached, etc...

Given a static representation of your processor network, it shouldn't be too difficult to also provide an incremental input/output system that would read the data, serialize/deserialize the state, and write any output.


As a quick, rough starting point, here's an example of probably the simplest version of what I've outlined above, ignoring the logging issue for the moment:

data Transducer s a b = Transducer { curState :: s
                                   , runStep  :: s -> a -> (s, b)
                                   }

runTransducer :: Transducer s a b -> [a] -> [b]
runTransducer t [] = (t, [])
runTransducer t (x:xs) = let (s, y) = runStep t (curState t) x
                             (t', ys) = runTransducer (t { curState = s }) xs
                         in (t', y:ys)

It's a simple, generic processor with explicit internal state of type s, input of type a, and output of type b. The runTransducer function shoves a list of inputs through, updating the state value manually, and collects a list of outputs.


P.S. -- since you were asking about monads, you might want to know if the example I gave is one. In fact, it's a combination of multiple common monads, though which ones depends on how you look at it. However, I've deliberately avoided treating it as a monad! The thing is, monads capture only abstractions that are in some sense very powerful, but that same power also makes them more resistant in some ways to optimization and static analysis. The main thing that needs to be ruled out is processors that take other processors as input and run them, which (as you can imagine) can create convoluted logic that's nearly impossible to analyze.

So, while the processors probably could be monads, and in some logical sense intrinsically are, it may be more useful to pretend that they aren't; imposing an artificial limitation in order to make static analysis simpler.

like image 141
C. A. McCann Avatar answered Nov 04 '22 12:11

C. A. McCann