For performance reasons I would like a zero-copy cast of ByteString
(strict, for now) to a Vector
. Since Vector
is just a ByteArray#
under the hood, and ByteString
is a ForeignPtr
this might look something like:
caseBStoVector :: ByteString -> Vector a
caseBStoVector (BS fptr off len) =
withForeignPtr fptr $ \ptr -> do
let ptr' = plusPtr ptr off
p = alignPtr ptr' (alignment (undefined :: a))
barr = ptrToByteArray# p len -- I want this function, or something similar
barr' = ByteArray barr
alignI = minusPtr p ptr
size = (len-alignI) `div` sizeOf (undefined :: a)
return (Vector 0 size barr')
That certainly isn't right. Even with the missing function ptrToByteArray#
this seems to need to escape the ptr
outside of the withForeignPtr
scope. So my quesetions are:
This post probably advertises my primitive understanding of ByteArray#
, if anyone can talk a bit about ByteArray#
, it's representation, how it is managed (GCed), etc I'd be grateful.
The fact that ByteArray#
lives on the GCed heap and ForeignPtr
is external seems to be a fundamental issue - all the access operations are different. Perhaps I should look at redefining Vector
from = ByteArray !Int !Int
to something with another indirection? Someing like = Location !Int !Int
where data Location = LocBA ByteArray | LocFPtr ForeignPtr
and provide wrapping operations for both those types? This indirection might hurt performance too much though.
Failing to marry these two together, maybe I can just access arbitrary element types in a ForeignPtr
in a more efficient manner. Does anyone know of a library that treats ForeignPtr
(or ByteString
) as an array of arbitrary Storable
or Primitive
types? This would still lose me the stream fusion and tuning from the Vector package.
Disclaimer: everything here is an implementation detail and specific to GHC and the internal representations of the libraries in question at the time of posting.
This response is a couple years after the fact, but it is indeed possible to get a pointer to bytearray contents. It's problematic as the GC likes to move data in the heap around, and things outside of the GC heap can leak, which isn't necessarily ideal. GHC solves this with:
newPinnedByteArray# :: Int# -> State# s -> (#State# s, MutableByteArray# s#)
Primitive bytearrays (internally typedef'd C char arrays) can be statically pinned to an address. The GC guarantees not to move them. You can convert a bytearray reference to a pointer with this function:
byteArrayContents# :: ByteArray# -> Addr#
The address type forms the basis of Ptr and ForeignPtr types. Ptrs are addresses marked with a phantom type and ForeignPtrs are that plus optional references to GHC memory and IORef finalizers.
Disclaimer: This will only work if your ByteString was built Haskell. Otherwise, you can't get a reference to the bytearray. You cannot dereference an arbitrary addr. Don't try to cast or coerce your way to a bytearray; that way lies segfaults. Example:
{-# LANGUAGE MagicHash, UnboxedTuples #-}
import GHC.IO
import GHC.Prim
import GHC.Types
main :: IO()
main = test
test :: IO () -- Create the test array.
test = IO $ \s0 -> case newPinnedByteArray# 8# s0 of {(# s1, mbarr# #) ->
-- Write something and read it back as baseline.
case writeInt64Array# mbarr# 0# 1# s1 of {s2 ->
case readInt64Array# mbarr# 0# s2 of {(# s3, x# #) ->
-- Print it. Should match what was written.
case unIO (print (I# x#)) s3 of {(# s4, _ #) ->
-- Convert bytearray to pointer.
case byteArrayContents# (unsafeCoerce# mbarr#) of {addr# ->
-- Dereference the pointer.
case readInt64OffAddr# addr# 0# s4 of {(# s5, x'# #) ->
-- Print what's read. Should match the above.
case unIO (print (I# x'#)) s5 of {(# s6, _ #) ->
-- Coerce the pointer into an array and try to read.
case readInt64Array# (unsafeCoerce# addr#) 0# s6 of {(# s7, y# #) ->
-- Haskell is not C. Arrays are not pointers.
-- This won't match. It might segfault. At best, it's garbage.
case unIO (print (I# y#)) s7 of (# s8, _ #) -> (# s8, () #)}}}}}}}}
Output:
1
1
(some garbage value)
To get the bytearray from a ByteString, you need to import the constructor from Data.ByteString.Internal and pattern match.
data ByteString = PS !(ForeignPtr Word8) !Int !Int
(\(PS foreignPointer offset length) -> foreignPointer)
Now we need to rip the goods out of the ForeignPtr. This part is entirely implementation-specific. For GHC, import from GHC.ForeignPtr.
data ForeignPtr a = ForeignPtr Addr# ForeignPtrContents
(\(ForeignPtr addr# foreignPointerContents) -> foreignPointerContents)
data ForeignPtrContents = PlainForeignPtr !(IORef (Finalizers, [IO ()]))
| MallocPtr (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()]))
| PlainPtr (MutableByteArray# RealWorld)
In GHC, ByteString is built with PlainPtrs which are wrapped around pinned byte arrays. They carry no finalizers. They are GC'd like regular Haskell data when they fall out of scope. Addrs don't count, though. GHC assumes they point to things outside of the GC heap. If the bytearray itself falls out of the scope, you're left with a dangling pointer.
data PlainPtr = (MutableByteArray# RealWorld)
(\(PlainPtr mutableByteArray#) -> mutableByteArray#)
MutableByteArrays are identical to ByteArrays. If you want true zero-copy construction, make sure you either unsafeCoerce# or unsafeFreeze# to a bytearray. Otherwise, GHC creates a duplicate.
mbarrTobarr :: MutableByteArray# s -> ByteArray#
mbarrTobarr = unsafeCoerce#
And now you have the raw contents of the ByteString ready to be turned into a vector.
Best Wishes,
You might be able to hack together something :: ForeignPtr -> Maybe ByteArray#
, but there is nothing you can do in general.
You should look at the Data.Vector.Storable
module. It includes a function unsafeFromForeignPtr :: ForeignPtr a -> Int -> Int -> Vector a
. It sounds like what you want.
There is also a Data.Vector.Storable.Mutable
variant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With