There are 3 function given below which find the last but second element in a list. The one using last . init
seems much faster than the rest. I can't seem to figure out why.
For testing, I used an input list of [1..100000000]
(100 million). The last one runs almost instantly whereas the others take several seconds.
-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"
-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse
-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init
When studying speed and optimization, it is very easy to obtain wildly wrong results. In particular, you cannot really say that one variant is faster than another without mentioning the compiler version and the optimization mode of your benchmarking setup. Even then, modern processors are so sophisticated as to feature neural network based branch predictors, not to mention all kinds of caches, so, even with careful set-up, benchmarking results will be blurry.
That being said...
criterion
is a package that provides advanced benchmarking tools. I quickly drafted a
benchmark like this:
module Main where
import Criterion
import Criterion.Main
-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"
-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse
-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init
butLast2 :: [a] -> a
butLast2 (x : _ : [ ] ) = x
butLast2 (_ : xs@(_ : _ ) ) = butLast2 xs
butLast2 _ = error "List too short"
setupEnv = do
let xs = [1 .. 10^7] :: [Int]
return xs
benches xs =
[ bench "slow?" $ nf myButLast xs
, bench "decent?" $ nf myButLast' xs
, bench "fast?" $ nf myButLast'' xs
, bench "match2" $ nf butLast2 xs
]
main = defaultMain
[ env setupEnv $ \ xs -> bgroup "main" $ let bs = benches xs in bs ++ reverse bs ]
As you see, I added the variant that explicitly matches on two elements at once, but otherwise it is the same code verbatim. I also run the benchmarks in reverse, so as to be aware of the bias due to caching. So, let us run and see!
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.5
% ghc -O2 -package criterion A.hs && ./A
benchmarking main/slow?
time 54.83 ms (54.75 ms .. 54.90 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 54.86 ms (54.82 ms .. 54.93 ms)
std dev 94.77 μs (54.95 μs .. 146.6 μs)
benchmarking main/decent?
time 794.3 ms (32.56 ms .. 1.293 s)
0.907 R² (0.689 R² .. 1.000 R²)
mean 617.2 ms (422.7 ms .. 744.8 ms)
std dev 201.3 ms (105.5 ms .. 283.3 ms)
variance introduced by outliers: 73% (severely inflated)
benchmarking main/fast?
time 84.60 ms (84.37 ms .. 84.95 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 84.46 ms (84.25 ms .. 84.77 ms)
std dev 435.1 μs (239.0 μs .. 681.4 μs)
benchmarking main/match2
time 54.87 ms (54.81 ms .. 54.95 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 54.85 ms (54.81 ms .. 54.92 ms)
std dev 104.9 μs (57.03 μs .. 178.7 μs)
benchmarking main/match2
time 50.60 ms (47.17 ms .. 53.01 ms)
0.993 R² (0.981 R² .. 0.999 R²)
mean 60.74 ms (56.57 ms .. 67.03 ms)
std dev 9.362 ms (6.074 ms .. 10.95 ms)
variance introduced by outliers: 56% (severely inflated)
benchmarking main/fast?
time 69.38 ms (56.64 ms .. 78.73 ms)
0.948 R² (0.835 R² .. 0.994 R²)
mean 108.2 ms (92.40 ms .. 129.5 ms)
std dev 30.75 ms (19.08 ms .. 37.64 ms)
variance introduced by outliers: 76% (severely inflated)
benchmarking main/decent?
time 770.8 ms (345.9 ms .. 1.004 s)
0.967 R² (0.894 R² .. 1.000 R²)
mean 593.4 ms (422.8 ms .. 691.4 ms)
std dev 167.0 ms (50.32 ms .. 226.1 ms)
variance introduced by outliers: 72% (severely inflated)
benchmarking main/slow?
time 54.87 ms (54.77 ms .. 55.00 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 54.95 ms (54.88 ms .. 55.10 ms)
std dev 185.3 μs (54.54 μs .. 251.8 μs)
Looks like our "slow" version is not slow at all! And the intricacies of pattern matching do not
add anything. (A slight speed up we see between two consecutive runs of match2
I ascribe to the
effects of caching.)
There is a way to get more "scientific" data: we can -ddump-simpl
and take a look at the
way the compiler sees our code.
"Core" is an internal language of GHC. Every Haskell source file is simplified to Core before
being transformed into the final functional graph for the run time system to execute. If we look
at this intermediate stage, it will tell us that myButLast
and butLast2
are equivalent. It
does take looking, since, at the renaming stage, all our nice identifiers are randomly mangled.
% for i in `seq 1 4`; do echo; cat A$i.hs; ghc -O2 -ddump-simpl A$i.hs > A$i.simpl; done
module A1 where
-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"
module A2 where
-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse
module A3 where
-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init
module A4 where
butLast2 :: [a] -> a
butLast2 (x : _ : [ ] ) = x
butLast2 (_ : xs@(_ : _ ) ) = butLast2 xs
butLast2 _ = error "List too short"
% ./EditDistance.hs *.simpl
(("A1.simpl","A2.simpl"),3866)
(("A1.simpl","A3.simpl"),3794)
(("A2.simpl","A3.simpl"),663)
(("A1.simpl","A4.simpl"),607)
(("A2.simpl","A4.simpl"),4188)
(("A3.simpl","A4.simpl"),4113)
It seems that A1
and A4
are the most similar. Thorough inspection will show that indeed the
code structures in A1
and A4
are identical. That A2
and A3
are alike is also reasonable
since both are defined as a composition of two functions.
If you are going to examine the core
output extensively, it makes sense to also supply flags such as -dsuppress-module-prefixes
and -dsuppress-uniques
. They make it so much easier to read.
So, what can go wrong with benchmarking and optimization?
ghci
, being designed for interactive play and quick iteration, compiles Haskell source to a certain flavour of byte code, rather than final executable, and eschews expensive optimizations in favour of quicker reload.This may look sad. But it is really not the thing that should concern a Haskell programmer, most of the time. Real story: I have a friend who just recently started learning Haskell. They had written a program for numerical integration, and it was turtle slow. So we sat down together and wrote a categorial description of the algorithm, with diagrams and stuff. When they re-wrote the code to align with the abstract description, it magically became, like, cheetah fast, and slim on memory too. We calculated π in no time. Moral of the story? Perfect abstract structure, and your code will optimize itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With