Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributing Haskell on a cluster

I have a piece of code that process files,

processFiles ::  [FilePath] -> (FilePath -> IO ()) -> IO ()

This function spawns an async process that execute an IO action. This IO action must be submitted to a cluster through a job scheduling system (e.g Slurm).

Because I must use the job scheduling system, it's not possible to use cloudHaskell to distribute the closure. Instead the program writes a new Main.hs containing the desired computations, that is copy to the cluster node together with all the modules that main depends on and then it is executed remotely with "runhaskell Main.hs [opts]". Then the async process should ask periodically to the job scheduling system (using threadDelay) if the job is done.

Is there a way to avoid creating a new Main? Can I serialize the IO action and execute it somehow in the node?

like image 768
felipez Avatar asked Mar 13 '15 18:03

felipez


Video Answer


1 Answers

Yep. There is a magical library called packman. It allows you to turn any haskell thing into data (as long as it does not have IORefs or related things in them.) Here the things you would need:

trySerialize :: a -> IO (Serialized a)
deserialize :: Serialized a -> IO a
instance Typeable a => Binary (Serialized a)

Yep, those are the exact types. You can package up your IO actions using trySerialize, use Binary to transfer it to wherever, and then deserialize to get the IO action out, ready for use.

Caveats for packman is that:

  • It stores things as thunks. This is probably what you want, so that the node can do the evaluating.
    • That said, if your thunk is huge, the Binary will probably be huge. Evaluating the thunk can fix this.
    • Like I said, mutable references are a no-no. One thing to watch out is them being inside thunks without you knowing it.

Other than that, this seems like what you want!

like image 186
PyRulez Avatar answered Sep 26 '22 02:09

PyRulez