I have a data structure, such as an expression tree or graph. I want to add some "measure" functions, such as depth
and size
.
How best to type these functions?
I see the following three variants as of roughly equal usefulness:
depth :: Expr -> Int
depth :: Expr -> Integer
depth :: Num a => Expr -> a
I have the following considerations in mind:
I'm looking at base
and fgl
as examples, and they consistently use Int
, but Data.List
also has functions such as genericLength
that are polymorphic in return type, and I am thinking that maybe the addition of these generic
functions is a reflection of a modernizing trend that I probably should respect and reinforce.
A similar movement of thought is noticeable in some widely used libraries providing a comprehensive set of functions with the same functionality when there are several probable choices of a return type to be desired by the user (e.g. xml-conduit
offers parsers that accept both lazy and strict kinds of either ByteString
or Text
).
Integer
is a nicer type overall than Int
and I sometimes find that I need to cast a length of a list to an Integer
, say because an algorighm that operates in Integer
needs to take this length into account.
Making functions return Integral
means these functions are made polymorphic, and it may have performance penalty. I don't know all the particulars well, but, as I understand, there may be some run-time cost, and polymorphic things are harder to memoize.
What is the accepted best practice? Which part of it is due to legacy and compatibility considerations? (I.e. if Data.List
was designed today, what type would functions such as length
have?) Did I miss any pros and cons?
Short answer: As a general rule use Int
, and if you need to convert it to something else, use fromIntegral
. (If you find yourself doing the conversion a lot, define fi = fromIntegral
to save typing or else create your own wrapper.)
The main consideration is performance. You want to write the algorithm so it uses an efficient integer type internally. Provided Int
is big enough for whatever calculation you're doing (the standard guarantees a signed 30-bit integer, but even on 32-bit platforms using GHC, it's a signed 32-bit integer), you can assume it will be a high-speed integer type on the platform, particularly in comparison to Integer
(which has boxing and bignum calculation overhead that can't be optimized away). Note that the performance differences can be substantial. Simple counting algorithms will often be 5-10x faster using Int
s compared to Integer
s.
While you could give your function a different signature:
depth :: Expr -> Integer
depth :: (Num a) => Expr -> a
but actually implement it under the hood using the efficient Int
type and do the conversion at the end, making the conversion implicit strikes me as poor practice. Particularly if this is a library function, making it clear that Int
is being used internally by making it part of the signature strikes me as more sensible.
With respect to your listed considerations:
First, the generic*
functions in Data.List
aren't modern. In particular, genericLength
was available in GHC 0.29, released July, 1996. At some prior point, length
had been defined in terms of genericLength
, as simply:
length :: [a] -> Int
length = genericLength
but in GHC 0.29, this definition was commented out with an #ifdef USE_REPORT_PRELUDE
, and several hand-optimized variants of length
were defined independently. The other generic*
functions weren't in 0.29, but they were already around by GHC 4.02 (1998).
Most importantly, when the Prelude
version of length
was generalized from lists to Foldable
s, which is a fairly recent development (since GHC 7.10?), nobody cared enough to do anything with genericLength
. I also don't think I've ever seen these functions used "in the wild" in any serious Haskell code. For the most part, you can think of them as deprecated.
Second, the use of lazy/strict and ByteString
/Text
variants in libraries represents a somewhat different situation. In particular, a conduit-xml
user will normally be making the decision between lazy and strict variants and between ByteString
and Text
types based on considerations about the data being processed and the construction of the algorithms that are far-reaching and pervade the entire type system of a given program. If the only way to use conduit-xml
with a lazy Text
type was to convert it piecemeal to strict ByteString
s, pass it to the library, and then pull it back out and convert it back to a lazy Text
type, no one would accept that complexity. In contrast, a monomorphic Int
-based definition of depth
works fine, because all you need is fromInteger . depth
to adapt it to any numeric context.
Third, as noted above, Integer
is only a "nicer" type from the standpoint of having arbitrary precision in situations where you don't care about performance. For things like depth
and count
in any practical setting, performance is likely to be more important than unlimited precision.
Fourth, I don't think either the runtime cost or failure-to-memoize should be serious considerations in choosing between polymorphic or non-polymorphic versions here. In most situations, GHC will generate a specialized version of the polymorphic function in a context where memoization is no problem.
On this basis, I suspect if Data.List
was designed today, it would still use Int
s.
I agree with all the points in K. A. Buhr's great answer, but here are a couple more:
You should use a return type of Integer
if you expect to support an expression tree that somehow doesn't fit in memory (which seems interesting, but unlikely). If I saw Expr -> Integer
I would go looking in the code or docs to try to understand how or why the codomain might be so large.
Re. performance of Integer
: normal machine-word arithmetic will be used if the number is not larger than the max width of a machine word. Simplifying, the type is basically:
data Integer = SmallInteger Int | LargeInteger ByteArray
K. A. Buhr mentions that there is an unavoidable performance penalty which is that this value cannot be unboxed (that is it will always have a heap representation, and will be read from and written to memory), and that does sound right to me.
In contrast, functions on Int
(or Word
) are often unboxed, so that in core you will see types that look like Int# -> Int# ->
. You can think of an Int#
as only existing in a machine register. This is what you want your numeric code to look like if you care about performance.
Re. polymorphic versions: designing libraries around concrete numeric inputs and polymorphic numeric outputs probably works okay, in terms of convenient type inference. We already have this to a certain degree, in that numeric literals are overloaded. There are certainly times when literals (e.g. also string literals when -XOverloadedStrings
) need to be given type signatures, and so I'd expect that if base
were designed to be more polymorphic that you would run into more occasions where this would be required (but fewer uses of fromIntegral
).
Another option you haven't mentioned is using Word
to express that the depth
is non-negative. This is more useful for inputs, but even then is often not worth it: Word
will still overflow, and negative literals are valid (although GHC will issue a warning); to a certain extent it's just moving where the bug occurs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With