I'm wondering to what extent do I have to worry about micro-optimizing common sub-expressions in Haskell (using GHC) especially with regard to data destructuring.
For example, consider this code for merging two ordered lists:
merge :: Ord a => [a] -> [a] -> [a]
merge [] bs = bs
merge as [] = as
merge (a:as) (b:bs) =
case compare a b of
LT -> a : merge as (b:bs) -- (1)
EQ -> a : merge as bs
GT -> b : merge (a:as) bs -- (2)
Will GHC recognize that the (a:as)
at (2) and (b:bs)
at (1) are the same as the input parameters? Or, should I use "as" patterns, e.g.:
merge a'@(a:as) b'@(b:bs) =
...
LT -> a : merge as b'
...
GT -> b : merge a' bs
And in general when can GHC recognize common sub-expressions in in data constructions and when do I need to help it?
Overview. We use pattern matching in Haskell to simplify our codes by identifying specific types of expression. We can also use if-else as an alternative to pattern matching. Pattern matching can also be seen as a kind of dynamic polymorphism where, based on the parameter list, different methods can be executed.
Haskell guards are used to test the properties of an expression; it might look like an if-else statement from a beginner's view, but they function very differently. Haskell guards can be simpler and easier to read than pattern matching .
It represents "computations that could fail to return a value". Just like with the fmap example, this lets you do a whole bunch of computations without having to explicitly check for errors after each step.
It depends. If you compile without optimizations, there is definitely a penalty to not using the as-patterns, but if you compile the following code
module Test where
merge1 :: Ord a => [a] -> [a] -> [a]
merge1 [] bs = bs
merge1 as [] = as
merge1 (a:as) (b:bs) = case compare a b of
LT -> a : merge1 as (b:bs)
EQ -> a : merge1 as bs
GT -> b : merge1 (a:as) bs
merge2 :: Ord a => [a] -> [a] -> [a]
merge2 [] bs = bs
merge2 as [] = as
merge2 xs@(a:as) ys@(b:bs) = case compare a b of
LT -> a : merge2 as ys
EQ -> a : merge2 as bs
GT -> b : merge2 xs bs
with -O2
and -ddump-simpl
, and after a lot of judicious editing, renaming of variables, stripping of metadata and annotations, and various other tricks you can arrive at something like
merge2_2 :: Ord a => a -> [a] -> [a] -> [a]
merge2_2 = \ ordInst x y z -> case z of _ {
[] -> (x:y);
(w:v) -> case compare ordInst x w of _ {
LT -> x : merge2_1 ordInst y w v;
EQ -> x : merge2 ordInst y v;
GT -> w : merge2_2 ordInst x y v
}
}
merge2_1 :: Ord a => [a] -> a -> [a] -> [a]
merge2_1 = \ ordInst x y z -> case x of _ {
[] -> (y:z);
(w:v) -> case compare ordInst w y of _ {
LT -> w : merge2_1 ordInst v y z;
EQ -> w : merge2 ordInst v z;
GT -> y : merge2_2 ordInst w v z
}
}
merge2 :: Ord a => [a] -> [a] -> [a]
merge2 = \ ordInst x y -> case x of wild {
[] -> y;
(w:v) -> case y of _ {
[] -> wild;
(z:zs) -> case compare ordInst w z of _ {
LT -> w : merge2_1 ordInst v z zs;
EQ -> w : merge2 ordInst v zs;
GT -> z : merge2_2 ordInst w v zs
}
}
}
merge1_2 :: Ord a => a -> [a] -> [a] -> [a]
merge1_2 = \ ordInst x y z -> case z of _ {
[] -> (x:y);
(w:v) -> case compare ordInst x w of _ {
LT -> x : merge1_1 ordInst y w v;
EQ -> x : merge1 ordInst y v;
GT -> w : merge1_2 ordInst x y v
}
}
merge1_1 :: Ord a => [a] -> a -> [a] -> [a]
merge1_1 = \ ordInst x y z -> case x of _ {
[] -> (y:z);
(w:v) -> case compare ordInst w y of _ {
LT -> w : merge1_1 ordInst v y z;
EQ -> w : merge1 ordInst v z;
GT -> y : merge1_2 ordInst w v z
}
}
merge1 :: Ord a => [a] -> [a] -> [a]
merge1 = \ ordInst x y -> case x of wild {
[] -> y;
(w:v) -> case y of _ {
[] -> wild;
(z:zs) -> case compare ordInst w z of _ {
LT -> w : merge1_1 ordInst v z zs;
EQ -> w : merge1 ordInst v zs;
GT -> z : merge1_2 ordInst w v zs
}
}
}
Which, after comparison with a diffing tool, tells me that the only differences between these two definitions is the numbers at the ends of the merge
function names. So yes, after applying optimizations GHC can figure out these as-patterns for you, at least in the case of lists. However, I would still recommend using them because there are always edge cases, and if you do use them then you don't have to trust the compiler to do something complicated. It also means that even if optimizations aren't enabled, your code will still have that optimization in there. It may seem like a tiny change, but if this function ends up in an inner loop or the structures you're working on are very large it could have a significant performance impact. Additionally, I find that for constructors with a large number of fields it's often just more convenient to have a name to refer to "constructed" form.
Here's the full core dump if you're interested. Basically, I started by joining lines together to take up less room, then removed unnecessary noise in the type signatures, removed qualified module names, all of the [Something]
annotations, renamed things like merge2_$smerge2
to merge2_2
, removed all local type signatures, then started renaming using Sublime Text's wonderful multiple cursors feature. I started with the type names, renaming them all to just a
, then renamed variable names to x
, y
, z
, w
, v
, and zs
(I'm creative, aren't I?), made all applications of the :
operator infix, removed the Rec
regions, and a few more bits of styling later I got the output you see above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With