I'm writing an application that interacts with a large (10-1000 GB) memory-mapped binary file, holding essentially a bunch of objects that refer to each other. I've come up with a mechanism to read/write this data which is effective, but ugly and verbose (imo).
Q: Is there a more elegant way to achieve what I've done?
I have a typeclass for structured data, with one method that reads the structure into a Haskell datatype (DataOp
is a ReaderT
around IO
).
class DBStruct a where
structRead :: Addr a -> DataOp a
To make this more readable, I have another typeclass that defines what structure members go where:
class DBStruct st => StructMem structTy valTy name | structTy name -> valTy where
offset :: structTy -> valTy -> name -> Int64
I have a few helper functions that use the offset method for reading/writing structure elements, for reading structures from stored references, and for lazily deferring structure reads (to allow lazy reading of the entire file).
The problem with this is that it involves a lot of repetition to use. For one structure, I first have to define the Haskell type:
data RowBlock = RowBlock {rbNext :: Maybe RowBlock
,rbPrev :: Maybe RowBlock
,rbRows :: [RowTy]
}
Then the name
types:
data Next = Next
data Prev = Prev
data Count = Count
newtype Row = Row Int64
Then the instances for each structure member:
instance StructMem RowBlock (Maybe (Addr RowBlock)) Next where offset _ _ _ = 0
instance StructMem RowBlock (Maybe (Addr RowBlock)) Prev where offset _ _ _ = 8
instance StructMem RowBlock Int64 Count where offset _ _ _ = 16
instance StructMem RowBlock RowTy Row where offset _ _ (Row n) = 24 + n * 8
Then the structure read method:
instance DBStruct RowBlock where
structRead a = do
n <- elemMaybePtr a Next
p <- elemMaybePtr a Prev
c <- elemRead a Count
rs <- mapM (elemRead a . Row) [0 .. c-1]
return $ RowBlock n p rs
So all I've really accomplished is to re-implement C structs in a much more verbose (and slow) way. I would be happier if this were more concise while preserving type safety. Surely this is a commonly encountered problem.
A few possible alternatives I can think of are:
Data.Binary
, writing the ByteStrings
to disk the normal way.deriving Generic
to create generic read and write functionsEDIT: SSCCE as requested
You might try using Data.Binary with your Ptrs.
For writing:
Use Data.Binary to build a ByteString. A ByteString is a tuple (ForeignPtr Word8, Int, Int) that holds the address, offset, and length where the data is stored. You can use the Data.ByteString.Internal package to get toForeignPtr, which will unpack the tuple for you. Foreign.ForeignPtr provides withForeignPtr, which takes a function that performs an IO action via the pointer. In there you can memcpy (a binding for this is also provided in Data.ByteString.Internal) the bytestring storage to the mmapped Ptr you got from mmap.
For reading:
You can use Data.ByteString.Internal's fromForiegnPtr to turn a Ptr into a bytestring. This is basically what the mmap libraries do, but you could do it a record at a time instead of with the entire region. Once you have a ByteString view on the memory, you can unpack it with Data.Binary.
Another option is to take advantage of the fact that ByteString has an alternative implementation in Data.Vector.Storable.ByteString, which would let you use the Storable interface you're using now to read/write them to the mmaped Ptrs. The interface and basic type is isomorphic to the Data.ByteString one, but it's got the Storable instances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With