I wrote a storable vector instance for the data type below (original question here):
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:
If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
deriving (Show)
instance Storable Atoms where
sizeOf _ = 1 + sizeOf (undefined :: Int32)
alignment _ = 1 + alignment (undefined :: Int32)
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
t <- peek (castPtr p::Ptr Word8)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
1 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word8) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word8) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
Update:
Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values - that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell - slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning - not portable but it is not an issue for my use case):
Data constructor change:
data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16
Storable sizeof and alignment changes:
instance Storable Atoms where
sizeOf _ = 2*sizeOf (undefined :: Int32)
alignment _ = 4
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
t <- peek (castPtr p::Ptr Word32)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
_ -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word32) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word32) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word32) `plusPtr` 1
Four or eight byte aligned memory access is typically much faster than oddly aligned access. It may be that the alignment for your instance is automatically rounded up to eight bytes, but I'd advise to at least measure with explicit eight byte alignment, using 32 bits (Int32
or Word32
) for the constructor tag and reading and writing both types of payloads as Int32
. That'll waste bits, but there's a good chance it'll be faster. Since you're on a 64-bit platform, it may be even faster to use 16-byte alignment and reading/writing Int64
. Benchmark, benchmark, benchmark to find out what serves you best.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With