I have some data which can be represented by an unsigned Integral
type and its biggest value requires 52 bits. AFAIK only Integer
, Int64
and Word64
satisfy these requirements.
All the information I could find out about those types was that Integer
is signed and has a floating unlimited bit-size, Int64
and Word64
are fixed and signed and unsigned respectively. What I coudn't find out was the information on the actual implementation of those types:
How many bits will a 52-bit value actually occupy if stored as an Integer
?
Am I correct that Int64
and Word64
allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
How many bits will a 52-bit value actually occupy if stored as an
Integer
?
This is implementation-dependent. With GHC, values that fit inside a machine word are stored directly in a constructor of Integer
, so if you're on a 64-bit machine, it should take the same amount of space as an Int. This corresponds to the S#
constructor of Integer
:
data Integer = S# Int#
| J# Int# ByteArray#
Larger values (i.e. those represented with J#
) are stored with GMP.
Am I correct that
Int64
andWord64
allow you to store a 64-bit data and weigh exactly 64 bits for any value?
Not quite — they're boxed. An Int64
is actually a pointer to either an unevaluated thunk or a one-word pointer to an info table plus a 64-bit integer value. (See the GHC commentary for more information.)
If you really want something that's guaranteed to be 64 bits, no exceptions, then you can use an unboxed type like Int64#
, but I would strongly recommend profiling first; unboxed values are quite painful to use. For instance, you can't use unboxed types as arguments to type constructors, so you can't have a list of Int64#
s. You also have to use operations specific to unboxed integers. And, of course, all of this is extremely GHC-specific.
If you're looking to store a lot of 52-bit integers, you might want to use vector or repa (built on vector, with fancy things like automatic parallelism); they store the values unboxed under the hood, but let you work with them in boxed form. (Of course, each individual value you take out will be boxed.)
Are any of those types more performant or preferrable for any other reasons than size, e.g. native code implementations or direct processor instructions-related optimizations?
Yes; using Integer
incurs a branch for every operation, since it has to distinguish the machine-word and bignum cases; and, of course, it has to handle overflow. Fixed-size integral types avoid this overhead.
And just in case: which one would you recommend for storing a 52-bit value in an application extremely sensitive in terms of performance?
If you're using a 64-bit machine: Int64
or, if you must, Int64#
.
If you're using a 32-bit machine: Probably Integer
, since on 32-bit Int64
is emulated with FFI calls to GHC functions that are probably not very highly optimised, but I'd try both and benchmark it. With Integer
, you'll get the best performance on small integers, and GMP is heavily-optimised, so it'll probably do better on the larger ones than you might think.
You could select between Int64
and Integer
at compile-time using the C preprocessor (enabled with {-# LANGUAGE CPP #-}
); I think it would be easy to get Cabal to control a #define
based on the word width of the target architecture. Beware, of course, that they are not the same; you will have to be careful to avoid "overflows" in the Integer
code, and e.g. Int64
is an instance of Bounded
but Integer
is not. It might be simplest to just target a single word width (and thus type) for performance and live with the slower performance on the other.
I would suggest creating your own Int52
type as a newtype
wrapper over Int64
, or a Word52
wrapper over Word64
— just pick whichever matches your data better, there should be no performance impact; if it's just arbitrary bits I'd go with Int64
, just because Int
is more common than Word
.
You can define all the instances to handle wrapping automatically (try :info Int64
in GHCi to find out which instances you'll want to define), and provide "unsafe" operations that just apply directly under the newtype
for performance-critical situations where you know there won't be any overflow.
Then, if you don't export the newtype
constructor, you can always swap the implementation of Int52
later, without changing any of the rest of your code. Don't worry about the overhead of a separate type — the runtime representation of a newtype
is completely identical to the underlying type; they only exist at compile-time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With