Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing large structured binary data with Haskell

Tags:

haskell

I'm writing an application that interacts with a large (10-1000 GB) memory-mapped binary file, holding essentially a bunch of objects that refer to each other. I've come up with a mechanism to read/write this data which is effective, but ugly and verbose (imo).

Q: Is there a more elegant way to achieve what I've done?

I have a typeclass for structured data, with one method that reads the structure into a Haskell datatype (DataOp is a ReaderT around IO).

class DBStruct a where
    structRead :: Addr a -> DataOp a

To make this more readable, I have another typeclass that defines what structure members go where:

class DBStruct st => StructMem structTy valTy name | structTy name -> valTy where
    offset :: structTy -> valTy -> name -> Int64

I have a few helper functions that use the offset method for reading/writing structure elements, for reading structures from stored references, and for lazily deferring structure reads (to allow lazy reading of the entire file).

The problem with this is that it involves a lot of repetition to use. For one structure, I first have to define the Haskell type:

data RowBlock = RowBlock {rbNext :: Maybe RowBlock
                         ,rbPrev :: Maybe RowBlock
                         ,rbRows :: [RowTy]
                         }

Then the name types:

data Next = Next
data Prev = Prev
data Count = Count
newtype Row = Row Int64

Then the instances for each structure member:

instance StructMem RowBlock (Maybe (Addr RowBlock)) Next where offset _ _ _ = 0
instance StructMem RowBlock (Maybe (Addr RowBlock)) Prev where offset _ _ _ = 8
instance StructMem RowBlock Int64 Count where offset _ _ _ = 16
instance StructMem RowBlock RowTy Row where offset _ _ (Row n) = 24 + n * 8

Then the structure read method:

instance DBStruct RowBlock where
    structRead a = do
        n <- elemMaybePtr a Next
        p <- elemMaybePtr a Prev
        c <- elemRead a Count
        rs <- mapM (elemRead a . Row) [0 .. c-1]
        return $ RowBlock n p rs

So all I've really accomplished is to re-implement C structs in a much more verbose (and slow) way. I would be happier if this were more concise while preserving type safety. Surely this is a commonly encountered problem.

A few possible alternatives I can think of are:

  • Ditch memory-mapped files and use Data.Binary, writing the ByteStrings to disk the normal way.
  • Use deriving Generic to create generic read and write functions
  • Overload Functional References
  • Do something magical with monadic lenses.

EDIT: SSCCE as requested

like image 326
Dan Avatar asked Nov 13 '22 00:11

Dan


1 Answers

You might try using Data.Binary with your Ptrs.

For writing:

Use Data.Binary to build a ByteString. A ByteString is a tuple (ForeignPtr Word8, Int, Int) that holds the address, offset, and length where the data is stored. You can use the Data.ByteString.Internal package to get toForeignPtr, which will unpack the tuple for you. Foreign.ForeignPtr provides withForeignPtr, which takes a function that performs an IO action via the pointer. In there you can memcpy (a binding for this is also provided in Data.ByteString.Internal) the bytestring storage to the mmapped Ptr you got from mmap.

For reading:

You can use Data.ByteString.Internal's fromForiegnPtr to turn a Ptr into a bytestring. This is basically what the mmap libraries do, but you could do it a record at a time instead of with the entire region. Once you have a ByteString view on the memory, you can unpack it with Data.Binary.

Another option is to take advantage of the fact that ByteString has an alternative implementation in Data.Vector.Storable.ByteString, which would let you use the Storable interface you're using now to read/write them to the mmaped Ptrs. The interface and basic type is isomorphic to the Data.ByteString one, but it's got the Storable instances.

like image 139
Levi Pearson Avatar answered Nov 22 '22 05:11

Levi Pearson