Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What type to give to a function that measures a data structure? Int, Integer, Integral?

Tags:

haskell

I have a data structure, such as an expression tree or graph. I want to add some "measure" functions, such as depth and size.

How best to type these functions?

I see the following three variants as of roughly equal usefulness:

  • depth :: Expr -> Int
  • depth :: Expr -> Integer
  • depth :: Num a => Expr -> a

I have the following considerations in mind:

  • I'm looking at base and fgl as examples, and they consistently use Int, but Data.List also has functions such as genericLength that are polymorphic in return type, and I am thinking that maybe the addition of these generic functions is a reflection of a modernizing trend that I probably should respect and reinforce.

  • A similar movement of thought is noticeable in some widely used libraries providing a comprehensive set of functions with the same functionality when there are several probable choices of a return type to be desired by the user (e.g. xml-conduit offers parsers that accept both lazy and strict kinds of either ByteString or Text).

  • Integer is a nicer type overall than Int and I sometimes find that I need to cast a length of a list to an Integer, say because an algorighm that operates in Integer needs to take this length into account.

  • Making functions return Integral means these functions are made polymorphic, and it may have performance penalty. I don't know all the particulars well, but, as I understand, there may be some run-time cost, and polymorphic things are harder to memoize.

What is the accepted best practice? Which part of it is due to legacy and compatibility considerations? (I.e. if Data.List was designed today, what type would functions such as length have?) Did I miss any pros and cons?

like image 578
Ignat Insarov Avatar asked Jan 03 '23 21:01

Ignat Insarov


2 Answers

Short answer: As a general rule use Int, and if you need to convert it to something else, use fromIntegral. (If you find yourself doing the conversion a lot, define fi = fromIntegral to save typing or else create your own wrapper.)

The main consideration is performance. You want to write the algorithm so it uses an efficient integer type internally. Provided Int is big enough for whatever calculation you're doing (the standard guarantees a signed 30-bit integer, but even on 32-bit platforms using GHC, it's a signed 32-bit integer), you can assume it will be a high-speed integer type on the platform, particularly in comparison to Integer (which has boxing and bignum calculation overhead that can't be optimized away). Note that the performance differences can be substantial. Simple counting algorithms will often be 5-10x faster using Ints compared to Integers.

While you could give your function a different signature:

depth :: Expr -> Integer
depth :: (Num a) => Expr -> a

but actually implement it under the hood using the efficient Int type and do the conversion at the end, making the conversion implicit strikes me as poor practice. Particularly if this is a library function, making it clear that Int is being used internally by making it part of the signature strikes me as more sensible.

With respect to your listed considerations:

First, the generic* functions in Data.List aren't modern. In particular, genericLength was available in GHC 0.29, released July, 1996. At some prior point, length had been defined in terms of genericLength, as simply:

length :: [a] -> Int
length = genericLength

but in GHC 0.29, this definition was commented out with an #ifdef USE_REPORT_PRELUDE, and several hand-optimized variants of length were defined independently. The other generic* functions weren't in 0.29, but they were already around by GHC 4.02 (1998).

Most importantly, when the Prelude version of length was generalized from lists to Foldables, which is a fairly recent development (since GHC 7.10?), nobody cared enough to do anything with genericLength. I also don't think I've ever seen these functions used "in the wild" in any serious Haskell code. For the most part, you can think of them as deprecated.

Second, the use of lazy/strict and ByteString/Text variants in libraries represents a somewhat different situation. In particular, a conduit-xml user will normally be making the decision between lazy and strict variants and between ByteString and Text types based on considerations about the data being processed and the construction of the algorithms that are far-reaching and pervade the entire type system of a given program. If the only way to use conduit-xml with a lazy Text type was to convert it piecemeal to strict ByteStrings, pass it to the library, and then pull it back out and convert it back to a lazy Text type, no one would accept that complexity. In contrast, a monomorphic Int-based definition of depth works fine, because all you need is fromInteger . depth to adapt it to any numeric context.

Third, as noted above, Integer is only a "nicer" type from the standpoint of having arbitrary precision in situations where you don't care about performance. For things like depth and count in any practical setting, performance is likely to be more important than unlimited precision.

Fourth, I don't think either the runtime cost or failure-to-memoize should be serious considerations in choosing between polymorphic or non-polymorphic versions here. In most situations, GHC will generate a specialized version of the polymorphic function in a context where memoization is no problem.

On this basis, I suspect if Data.List was designed today, it would still use Ints.

like image 85
K. A. Buhr Avatar answered Jan 26 '23 00:01

K. A. Buhr


I agree with all the points in K. A. Buhr's great answer, but here are a couple more:

You should use a return type of Integer if you expect to support an expression tree that somehow doesn't fit in memory (which seems interesting, but unlikely). If I saw Expr -> Integer I would go looking in the code or docs to try to understand how or why the codomain might be so large.

Re. performance of Integer: normal machine-word arithmetic will be used if the number is not larger than the max width of a machine word. Simplifying, the type is basically:

data Integer = SmallInteger Int | LargeInteger ByteArray

K. A. Buhr mentions that there is an unavoidable performance penalty which is that this value cannot be unboxed (that is it will always have a heap representation, and will be read from and written to memory), and that does sound right to me.

In contrast, functions on Int (or Word) are often unboxed, so that in core you will see types that look like Int# -> Int# ->. You can think of an Int# as only existing in a machine register. This is what you want your numeric code to look like if you care about performance.

Re. polymorphic versions: designing libraries around concrete numeric inputs and polymorphic numeric outputs probably works okay, in terms of convenient type inference. We already have this to a certain degree, in that numeric literals are overloaded. There are certainly times when literals (e.g. also string literals when -XOverloadedStrings) need to be given type signatures, and so I'd expect that if base were designed to be more polymorphic that you would run into more occasions where this would be required (but fewer uses of fromIntegral).

Another option you haven't mentioned is using Word to express that the depth is non-negative. This is more useful for inputs, but even then is often not worth it: Word will still overflow, and negative literals are valid (although GHC will issue a warning); to a certain extent it's just moving where the bug occurs.

like image 42
jberryman Avatar answered Jan 26 '23 00:01

jberryman