When finding last but second element of a list, why is using `last` the fastest among these?

Question

There are 3 function given below which find the last but second element in a list. The one using last . init seems much faster than the rest. I can't seem to figure out why.

For testing, I used an input list of [1..100000000] (100 million). The last one runs almost instantly whereas the others take several seconds.

-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"

-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse

-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init

Ignat Insarov · Accepted Answer

When studying speed and optimization, it is very easy to obtain wildly wrong results. In particular, you cannot really say that one variant is faster than another without mentioning the compiler version and the optimization mode of your benchmarking setup. Even then, modern processors are so sophisticated as to feature neural network based branch predictors, not to mention all kinds of caches, so, even with careful set-up, benchmarking results will be blurry.

That being said...

Benchmarking is our friend.

criterion is a package that provides advanced benchmarking tools. I quickly drafted a benchmark like this:

module Main where

import Criterion
import Criterion.Main

-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"

-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse

-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init

butLast2 :: [a] -> a
butLast2 (x :     _ : [ ] ) = x
butLast2 (_ : xs@(_ : _ ) ) = butLast2 xs
butLast2 _ = error "List too short"

setupEnv = do
  let xs = [1 .. 10^7] :: [Int]
  return xs

benches xs =
  [ bench "slow?"   $ nf myButLast   xs
  , bench "decent?" $ nf myButLast'  xs
  , bench "fast?"   $ nf myButLast'' xs
  , bench "match2"  $ nf butLast2    xs
  ]

main = defaultMain
    [ env setupEnv $ \ xs -> bgroup "main" $ let bs = benches xs in bs ++ reverse bs ]

As you see, I added the variant that explicitly matches on two elements at once, but otherwise it is the same code verbatim. I also run the benchmarks in reverse, so as to be aware of the bias due to caching. So, let us run and see!

% ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.5


% ghc -O2 -package criterion A.hs && ./A
benchmarking main/slow?
time                 54.83 ms   (54.75 ms .. 54.90 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 54.86 ms   (54.82 ms .. 54.93 ms)
std dev              94.77 μs   (54.95 μs .. 146.6 μs)

benchmarking main/decent?
time                 794.3 ms   (32.56 ms .. 1.293 s)
                     0.907 R²   (0.689 R² .. 1.000 R²)
mean                 617.2 ms   (422.7 ms .. 744.8 ms)
std dev              201.3 ms   (105.5 ms .. 283.3 ms)
variance introduced by outliers: 73% (severely inflated)

benchmarking main/fast?
time                 84.60 ms   (84.37 ms .. 84.95 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 84.46 ms   (84.25 ms .. 84.77 ms)
std dev              435.1 μs   (239.0 μs .. 681.4 μs)

benchmarking main/match2
time                 54.87 ms   (54.81 ms .. 54.95 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 54.85 ms   (54.81 ms .. 54.92 ms)
std dev              104.9 μs   (57.03 μs .. 178.7 μs)

benchmarking main/match2
time                 50.60 ms   (47.17 ms .. 53.01 ms)
                     0.993 R²   (0.981 R² .. 0.999 R²)
mean                 60.74 ms   (56.57 ms .. 67.03 ms)
std dev              9.362 ms   (6.074 ms .. 10.95 ms)
variance introduced by outliers: 56% (severely inflated)

benchmarking main/fast?
time                 69.38 ms   (56.64 ms .. 78.73 ms)
                     0.948 R²   (0.835 R² .. 0.994 R²)
mean                 108.2 ms   (92.40 ms .. 129.5 ms)
std dev              30.75 ms   (19.08 ms .. 37.64 ms)
variance introduced by outliers: 76% (severely inflated)

benchmarking main/decent?
time                 770.8 ms   (345.9 ms .. 1.004 s)
                     0.967 R²   (0.894 R² .. 1.000 R²)
mean                 593.4 ms   (422.8 ms .. 691.4 ms)
std dev              167.0 ms   (50.32 ms .. 226.1 ms)
variance introduced by outliers: 72% (severely inflated)

benchmarking main/slow?
time                 54.87 ms   (54.77 ms .. 55.00 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 54.95 ms   (54.88 ms .. 55.10 ms)
std dev              185.3 μs   (54.54 μs .. 251.8 μs)

Looks like our "slow" version is not slow at all! And the intricacies of pattern matching do not add anything. (A slight speed up we see between two consecutive runs of match2 I ascribe to the effects of caching.)

There is a way to get more "scientific" data: we can -ddump-simpl and take a look at the way the compiler sees our code.

Inspection of intermediate structures is our friend.

"Core" is an internal language of GHC. Every Haskell source file is simplified to Core before being transformed into the final functional graph for the run time system to execute. If we look at this intermediate stage, it will tell us that myButLast and butLast2 are equivalent. It does take looking, since, at the renaming stage, all our nice identifiers are randomly mangled.

% for i in `seq 1 4`; do echo; cat A$i.hs; ghc -O2 -ddump-simpl A$i.hs > A$i.simpl; done

module A1 where

-- slow
myButLast :: [a] -> a
myButLast [x, y] = x
myButLast (x : xs) = myButLast xs
myButLast _ = error "List too short"

module A2 where

-- decent
myButLast' :: [a] -> a
myButLast' = (!! 1) . reverse

module A3 where

-- fast
myButLast'' :: [a] -> a
myButLast'' = last . init

module A4 where

butLast2 :: [a] -> a
butLast2 (x :     _ : [ ] ) = x
butLast2 (_ : xs@(_ : _ ) ) = butLast2 xs
butLast2 _ = error "List too short"

% ./EditDistance.hs *.simpl
(("A1.simpl","A2.simpl"),3866)
(("A1.simpl","A3.simpl"),3794)
(("A2.simpl","A3.simpl"),663)
(("A1.simpl","A4.simpl"),607)
(("A2.simpl","A4.simpl"),4188)
(("A3.simpl","A4.simpl"),4113)

It seems that A1 and A4 are the most similar. Thorough inspection will show that indeed the code structures in A1 and A4 are identical. That A2 and A3 are alike is also reasonable since both are defined as a composition of two functions.

If you are going to examine the core output extensively, it makes sense to also supply flags such as -dsuppress-module-prefixes and -dsuppress-uniques. They make it so much easier to read.

A short list of our enemies, too.

So, what can go wrong with benchmarking and optimization?

ghci, being designed for interactive play and quick iteration, compiles Haskell source to a certain flavour of byte code, rather than final executable, and eschews expensive optimizations in favour of quicker reload.
Profiling seems like a nice tool to look into performance of individual bits and pieces of a complex program, but it can wreck compiler optimizations so badly, the results will be orders of magnitude off base.
- Your safeguard is to profile every small bit of code as a separate executable, with its own benchmark runner.
Garbage collection is tuneable. Just today a new major feature was released. Delays for garbage collection will affect performance in ways that are not straightforward to predict.
As I mentioned, different compiler versions will build different code with different performance, so you have to know what version the user of your code will likely use to build it, and benchmark with that, before you make any promises.

This may look sad. But it is really not the thing that should concern a Haskell programmer, most of the time. Real story: I have a friend who just recently started learning Haskell. They had written a program for numerical integration, and it was turtle slow. So we sat down together and wrote a categorial description of the algorithm, with diagrams and stuff. When they re-wrote the code to align with the abstract description, it magically became, like, cheetah fast, and slim on memory too. We calculated π in no time. Moral of the story? Perfect abstract structure, and your code will optimize itself.

When finding last but second element of a list, why is using `last` the fastest among these?

Tags:

haskell

storm125

Video Answer

1 Answers

Benchmarking is our friend.

Inspection of intermediate structures is our friend.

A short list of our enemies, too.

Ignat Insarov

Recent Activity

Donate For Us

When finding last but second element of a list, why is using `last` the fastest among these?

Tags:

haskell

storm125

Video Answer

1 Answers

Benchmarking is our friend.

Inspection of intermediate structures is our friend.

A short list of our enemies, too.

Ignat Insarov

Related questions

Recent Activity

Donate For Us