Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting 64-bit Double to ByteString efficiently

I wrote a function to convert 64-bit Double to ByteString (architecture/type safety is not really an issue - let us assume for now that the Double is 64-bit Word). While the function below works well, I am wondering if there is a faster way to convert the Double to ByteString. In the code below, there is one unpack of Word64 into Word8 list, followed by reverse (to make it little endian format), and then packing into ByteString. The code is below:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)

encodeDouble :: Double -> ByteString
encodeDouble (D# x) = pack $ reverse $ unpack64 $ W64# (unsafeCoerce# x)

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

-- function to convert list of bytestring into hex digits - for debugging
bprint :: ByteString -> String
bprint x = ("0x" ++ ) $ foldl (++) "" $ fmap (printf "%02x") $ unpack x

main = putStrLn $ bprint $ encodeDouble 7234.4

A sample GHCi output on Mac x86:

*Main> bprint $ encodeDouble 7234.4
"0x666666666642bc40"

While the code seems to work well, I plan to use it for encoding lot of Double values into ByteString before sending it over IPC. So, I will appreciate pointers on making it faster, if there are any.

It does seem to me that double must be unpacked into Word8, and then packed into ByteString. So, may be the overall algorithm as it is, can't be improved upon much. But, using a more efficient unpack/pack function will probably make a difference, if there were one.

EDIT1: I just discovered another complication on Mac (GHC 7.0.3) - the code above won't compile in GHC because of this error - I was testing in GHCi so far:

$ ghc -O --make t.hs
[1 of 1] Compiling Main             ( t.hs, t.o )

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:285:0:
    suffix or operands invalid for `movsd'

/var/folders/_q/33htc59519b3xq7y6xv100z40000gp/T/ghc6976_0/ghc6976_0.s:304:0:
    suffix or operands invalid for `movsd'

So, it looks like I have to fall back on FFI (cereal/data-binary-ieee754 package) until this bug is fixed, or until I find a workaround. Looks like related to GHC Ticket 4092. Please correct me if this is a new bug, or a different bug. For now, I can't compile it :(

EDIT2: Updating the code to use unsafeCoerce fixes the compilation issue. Code below with Criterion benchmark:

{-# LANGUAGE MagicHash #-}
import GHC.Prim
import GHC.Types
import GHC.Word
import Data.Bits (shiftR)
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (ByteString)
import Text.Printf (printf)
import Unsafe.Coerce
import Criterion.Main

--encodeDouble :: Double -> ByteString
encodeDouble  x = pack $ reverse $ unpack64 $ unsafeCoerce x

unpack64 :: Word64 -> [Word8]
unpack64 x = map (fromIntegral.(shiftR x)) [56,48..0]

main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

Criterion Output (truncated):

estimating cost of a clock call...
mean is 46.09080 ns (36 iterations)

benchmarking encodeDouble/78901.234
mean: 218.8732 ns, lb 218.4946 ns, ub 219.3389 ns, ci 0.950
std dev: 2.134809 ns, lb 1.757455 ns, ub 2.568828 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 219.5382 ns, lb 219.0744 ns, ub 220.1296 ns, ci 0.950
std dev: 2.675674 ns, lb 2.197591 ns, ub 3.451464 ns, ci 0.950

On further analysis, most of the bottleneck seems to be in unpack64. Coercion takes ~6ns. unpack64 takes ~195ns. Unpacking the word64 as a list of word8 is quite expensive here.

like image 684
Sal Avatar asked Dec 02 '11 01:12

Sal


3 Answers

I recently added support for IEEE-754 floats to cereal, and you can find similar functions for binary in data-binary-ieee754. Here's an example using the cereal version to roundtrip pi to a ByteStringand back:

Prelude Data.Serialize> runGet getFloat64be $ runPut $ putFloat64be pi
Right 3.141592653589793

It uses a trick with ST arrays to do the conversion quickly; see this earlier question for more details.

Update: D'oh, I should know how to use calls I contributed to the library...

Update x2: Regarding the compile failure, I don't think this qualifies as a bug.

I haven't looked too carefully at the generated assembly for this particular code, but the operands to a movsd instruction are getting fouled up. From §11.4.1.1 of the Intel x86 manual:

The MOVSD (move scalar double-precision floating-point) transfers a 64-bit double-precision floating-point operand from memory to the low quadword of an XMM register or vice versa, or between XMM registers.

In the unoptimized code, you have fine instructions like movsd LnTH(%rip),%xmm0, but in the -O code, you see things like movsd Ln2cJ(%rip),%rax, where %rax is a general-purpose register, rather than an XMM register.

The optimizer is likely making assumptions about the representations of data it needs to move between registers based on the type of data involved. unsafeCoerce and friends invalidate those assumptions, so when the instruction selector thinks it's choosing the right operation for a D#, it's actually emitting code that tries to stuff that D# where a W64# would happily fit.

Since handling this would require the optimizer to abandon many of the assumptions that let it emit better code under normal circumstances, I'm inclined to say this is not a bug but rather a good story for why unsafe functions bear a caveat emptor warning.

like image 93
acfoltzer Avatar answered Oct 13 '22 08:10

acfoltzer


Note that the use of unsafeCoerce# is dangerous here, the docs say

Casting an unboxed type to another unboxed type of the same size (but not coercions between floating-point and integral types)

Concerning the speed, it may be faster to avoid the intermediate list and directly write to the memory via unsafeCreate from Data.ByteString.Internal.

like image 24
Daniel Fischer Avatar answered Oct 13 '22 07:10

Daniel Fischer


Following the suggestion of acfoltzer (cereal source code), and Daniel Fischer (unsafeCreate), I wrote the code below that works well for my use case, and is fast too:

{-#LANGUAGE MagicHash #-}
import Data.ByteString (pack, unpack)
import Data.ByteString.Internal (unsafeCreate,ByteString)
import Data.Bits (shiftR)
import GHC.Int (Int64)
import GHC.Prim
import GHC.Types
import GHC.Word
import Unsafe.Coerce
import Criterion.Main
import Foreign

-- | Write a Word64 in little endian format
putWord64le :: Word64 -> Ptr Word8 -> IO()
putWord64le w p = do
  poke p               (fromIntegral (w)           :: Word8)
  poke (p `plusPtr` 1) (fromIntegral (shiftR w  8) :: Word8)
  poke (p `plusPtr` 2) (fromIntegral (shiftR w 16) :: Word8)
  poke (p `plusPtr` 3) (fromIntegral (shiftR w 24) :: Word8)
  poke (p `plusPtr` 4) (fromIntegral (shiftR w 32) :: Word8)
  poke (p `plusPtr` 5) (fromIntegral (shiftR w 40) :: Word8)
  poke (p `plusPtr` 6) (fromIntegral (shiftR w 48) :: Word8)
  poke (p `plusPtr` 7) (fromIntegral (shiftR w 56) :: Word8)

{-# INLINE putWord64le #-}

encodeDouble :: Double -> ByteString
encodeDouble x = unsafeCreate 8 (putWord64le $ unsafeCoerce x)

main :: IO ()
main = defaultMain [
        bgroup "encodeDouble" [
          bench "78901.234"  $ whnf encodeDouble 78901.234
          , bench "789.01" $ whnf encodeDouble 789.01
          ]
       ]

Criterion output (truncated):

estimating cost of a clock call...
mean is 46.80361 ns (35 iterations)
found 5 outliers among 35 samples (14.3%)
  3 (8.6%) high mild
  2 (5.7%) high severe

benchmarking encodeDouble/78901.234
mean: 18.80689 ns, lb 18.73805 ns, ub 18.97247 ns, ci 0.950
std dev: 516.7499 ps, lb 244.8588 ps, ub 1.043685 ns, ci 0.950

benchmarking encodeDouble/789.01
mean: 18.96963 ns, lb 18.90986 ns, ub 19.06127 ns, ci 0.950
std dev: 374.2191 ps, lb 275.3313 ps, ub 614.4281 ps, ci 0.950

From ~220ns down to ~19ns, nice! I didn't do anything fancy in compilation. Just -O flag will do in GHC7 (Mac, x86_64).

Now, trying to figure out how to do it fast with list of doubles!

like image 1
Sal Avatar answered Oct 13 '22 07:10

Sal