Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IOUArray to ByteSring, as quickly as possible

I need to mutate elements in a fixed size array of Word8 very quickly. For this purpose I am using an IOUArray. I need to send this array over a websocket connection. The function sendBinaryData from the websockets package requires a ByteString. I need to convert from one representation to the other. I am using this function currently:

arrayToBS :: IOUArray Int Word8 -> IO (BS.ByteString)
arrayToBS = (fmap BS.pack) . getElems

This function converts the elements of an array to a [Word8] before packing that list into a bytestring, and from profiling I can see that it is quite slow. I was wondering if there a way to speed up this function, or possibly send the array over a websocket connection directly?

The array that I am using currently is:

size = 1000;
numBytes = size * size * 4

newBuffer :: IO (IOUArray Int Word8)
newBuffer = newArray (0, numBytes) 200 :: IO (IOUArray Int Word8)

and an except from the performance report:

COST CENTRE MODULE SRC                        %time %alloc

arrayToBS   Lib    src/Lib.hs:28:1-37          88.1   99.0
newBuffer   Lib    src/Lib.hs:(23,1)-(25,12)    9.9    0.8

Ideally arrayToBS would be much faster than creating the array. If I change the size to 100:

COST CENTRE         MODULE                          SRC                                                %time %alloc

arrayToBS           Lib                             src/Lib.hs:21:1-37                           100.0   86.1
mkEncodeTable.table Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:105:5-75    0.0    8.0
mkEncodeTable.ix    Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:104:5-43    0.0    1.1
like image 369
felixgb Avatar asked Oct 18 '22 07:10

felixgb


2 Answers

Disclaimer: I'm not very familiar with these low level primitives so this might be unsafe in some cases.


You will at the very least need to copy the data over once since, as @user2407038 remarks, the underlying data stored in an IOUArray is an unpinned array, so we can't count on GHC not moving the array around. The reverse direction (ByteString to IOArray) is however possible without a copy.

{-# LANGUAGE UnboxedTuples, MagicHash #-}

import Data.ByteString.Internal (ByteString(..))
import Data.Array.IO.Internals  (IOUArray(..))
import Data.Array.Base          (STUArray(..))
import Data.Word                (Word8)

import Foreign.ForeignPtr (mallocForeignPtrBytes, withForeignPtr)
import GHC.IO             (IO(..))
import GHC.Exts           (copyMutableByteArrayToAddr#, Ptr(..), Int(..))

arrayToBS :: IOUArray Int Word8 -> IO ByteString
arrayToBS (IOUArray (STUArray _ _ n@(I# n') mutByteArr)) = do
  bytes <- mallocForeignPtrBytes n
  withForeignPtr bytes $ \(Ptr addr) -> IO $ \state ->
    (# copyMutableByteArrayToAddr# mutByteArr 0# addr n' state, () #)
  pure (PS bytes 0 n)

Here is a test of this working (remember that the ascii code for 'A' is 65):

ghci> iou <- newListArray (-2,9) [65,67..] :: IO (IOUArray Int Word8)
ghci> arrayToBS iou
"ACEGIKMOQSUW"
like image 169
Alec Avatar answered Nov 15 '22 05:11

Alec


Ok, thanks to user2407038 I have something (Note that I have never played with primitives or unboxed types before):

import Control.Monad.ST
import qualified Data.ByteString as BS
import Data.Word
import Data.Array.ST
import Data.Array.Base
import Data.ByteString.Internal
import GHC.Prim
import GHC.Exts
import GHC.ForeignPtr

bs2Addr# :: BS.ByteString -> Addr#
bs2Addr# (PS fptr offset len) = case fptr of
  (ForeignPtr addr _ ) -> addr

arrayPrim (STUArray _ _ _ x) = x

unbox :: Int -> Int#
unbox (I# n#) = n#

copy :: Int -> IO BS.ByteString
copy len = do
  -- Get the length as unboxed
  let len# = unbox len

  -- Bytestring to copy to, filled with 0s initially
  let bs = BS.pack (replicate len 0)

  -- Create a new STUArray. I don't know why it needs to be length * 2.
  arr <- stToIO (newArray (0, len * 2) 255 :: ST s (STUArray s Int Word8))

  -- MutableByteArray#
  let mArrPrim = arrayPrim arr

  -- Addr#
  let addr = bs2Addr# bs

  -- I don't know what the 2nd and 4th Int# arguments are suppose to be.
  let _ = copyMutableByteArrayToAddr# mArrPrim len# addr len# realWorld#
  return bs

I am using STUArray here instead of IOUArray now because I couldn't find the IOUArray constructor.

Results of profiling this code with 4000000 element array:

    Sun Aug 20 20:49 2017 Time and Allocation Profiling Report  (Final)

       shoot-exe +RTS -N -p -RTS

    total time  =        0.05 secs   (47 ticks @ 1000 us, 1 processor)
    total alloc = 204,067,640 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                        %time %alloc

copy.bs     Lib    src/Lib.hs:32:7-36          66.0   96.0
copy        Lib    src/Lib.hs:(27,1)-(45,11)   34.0    3.9

This is the code that I compared it against:

arrayToBS :: (STUArray s Int Word8) -> ST s (BS.ByteString)
arrayToBS = (fmap BS.pack) . getElems

slowCopy :: Int -> IO BS.ByteString
slowCopy len = do
  arr <- stToIO (newArray (0, len - 1) 255 :: ST s (STUArray s Int Word8))
  stToIO $ arrayToBS arr

And its profiling report:

    Sun Aug 20 20:48 2017 Time and Allocation Profiling Report  (Final)

       shoot-exe +RTS -N -p -RTS

    total time  =        0.55 secs   (548 ticks @ 1000 us, 1 processor)
    total alloc = 1,604,073,872 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                        %time %alloc

arrayToBS   Lib    src/Lib.hs:48:1-37          98.2   99.7
slowCopy    Lib    src/Lib.hs:(51,1)-(53,24)    1.6    0.2

OK, new version is faster. They both produce the same output. However, I would still like to know what the #Int parameters to copyMutableByteArrayToAddr# and why I have to multiply the length of the array in the fast version by 2. I'll play around a bit more and update this answer if I find out.

Update: Alec's answer

For those curious, this is the result of profiling Alec's answer:

    Sun Aug 20 21:13 2017 Time and Allocation Profiling Report  (Final)

       shoot-exe +RTS -N -p -RTS

    total time  =        0.01 secs   (7 ticks @ 1000 us, 1 processor)
    total alloc =   8,067,696 bytes  (excludes profiling overheads)

COST CENTRE   MODULE SRC                          %time %alloc

newBuffer     Other  src/Other.hs:23:1-33          85.7   49.6
arrayToBS.\.\ Other  src/Other.hs:19:5-69          14.3    0.0
arrayToBS     Other  src/Other.hs:(16,1)-(20,21)    0.0   49.6

Looks like that's the one to use.

like image 38
felixgb Avatar answered Nov 15 '22 05:11

felixgb