Optimization suggestions when writing storable vector definition for union struct

Question

I wrote a storable vector instance for the data type below (original question here):

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16

The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:

It is not specific to a GHC compiler version. You can assume GHC 6.12.3+ to exclude performance bugs if any present in earlier versions, and relevant to the code here.
Platform-specific suggestions are ok. You may assume x86_64 Linux platform.
A generic suggestion more in the form of algorithm improvement (big O) is very much valued, than a suggestion that exploits hardware-specific optimizations. But, given a basic operation like peek/poke here, there is not much scope for algorithmic improvement, as far as I can tell (and hence more valuable because it is a scarce commodity :)
Compiler flags for x86_64 are acceptable (e.g., telling compiler about removing floating point safe check etc.). I am using "-O2 --make" option to compile the code.

If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int

data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
                deriving (Show)

instance Storable Atoms where
  sizeOf _ = 1 + sizeOf (undefined :: Int32)
  alignment _ = 1 + alignment (undefined :: Int32)

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the    element. First byte is type of element
            t <- peek (castPtr p::Ptr Word8)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              1 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word8) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word8) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the     element. First byte is type of element

Update:

Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values - that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell - slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning - not portable but it is not an issue for my use case):

Data constructor change:

data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16

Storable sizeof and alignment changes:

instance Storable Atoms where
  sizeOf _ = 2*sizeOf (undefined :: Int32)
  alignment _ = 4

  {-# INLINE peek #-}
  peek p = do
            let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
            t <- peek (castPtr p::Ptr Word32)
            case t of
              0 -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
                    return (I x)
              _ -> do
                    x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
                    return (S x)

  {-# INLINE poke #-}
  poke p x = case x of
      I a -> do
              poke (castPtr p :: Ptr Word32) 0
              pokeElemOff (castPtr p1) 0 a
      S a -> do
              poke (castPtr p :: Ptr Word32) 1
              pokeElemOff (castPtr p1) 0 a
      where  p1 = (castPtr p :: Ptr Word32) `plusPtr` 1

Daniel Fischer · Accepted Answer

Four or eight byte aligned memory access is typically much faster than oddly aligned access. It may be that the alignment for your instance is automatically rounded up to eight bytes, but I'd advise to at least measure with explicit eight byte alignment, using 32 bits (Int32 or Word32) for the constructor tag and reading and writing both types of payloads as Int32. That'll waste bits, but there's a good chance it'll be faster. Since you're on a 64-bit platform, it may be even faster to use 16-byte alignment and reading/writing Int64. Benchmark, benchmark, benchmark to find out what serves you best.

Optimization suggestions when writing storable vector definition for union struct

Tags:

optimization

haskell

vector

storable

Sal

1 Answers

Daniel Fischer

Recent Activity

Donate For Us

Optimization suggestions when writing storable vector definition for union struct

Tags:

optimization

haskell

vector

storable

Sal

1 Answers

Daniel Fischer

Related questions

Recent Activity

Donate For Us