Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Haskell: List v. Array, difference in performance

Another question from a Haskell n00b.

I'm comparing the efficiency of various methods used to solve Problem #14 on the Project Euler website. In particular, I'm hoping to better understand the factors driving the difference in evaluation time for four (slightly) different approaches to solving the problem.

(Descriptions of problem #14 and the various approaches are below.)

First, a quick overview of Problem #14. It has to do with "Collatz numbers" (i.e., same programming exercise as my previous post which explored a different aspect of Haskell). A Collatz number for a given integer is equal to the length of the Collatz sequence for that integer. A Collatz sequence for an integer is calculated as follows: the first number ("n0") in the sequence is that integer itself; if n0 is even, the next number in the sequence ("n1") is equal to n / 2; if n0 is odd, then n1 is equal to 3 * n0 + 1. We continue recursively extending the sequence until we arrive at 1, at which point the sequence is finished. For example, the collatz sequence for 5 is: {5, 16, 8, 4, 2, 1} (because 16 = 3 * 5 + 1, 8 = 16 / 2, 4 = 8 / 2,...).

Problem 14 asks us to find the integer below 1,000,000 which has the largest Collatz number. To that effect, we can consider a function "collatz" which, when passed an integer "n" as an argument, returns the integer below n with the largest Collatz number. In other words, p 1000000 gives us the answer to Problem #14.

For the purposes of this exercise (i.e., understanding differences in evaluation time), we can consider Haskell versions of 'collatz' which vary across two dimensions:

(1) Implementation: Do we store the dataset of Collatz numbers (which will be generated for all integers 1..n) as a list or an array? I call this the "implementation" dimension, i.e., a function's implementation is either "list" or "array".

(2) Algorithm: do we calculate the Collatz number for any given integer n by extending out the Collatz sequence until it is complete (i.e., until we reach 1)? Or do we only extend out the sequence until we reach a number k which is smaller than n (at which point we can simply use the Collatz number of k, which we've already calculated)? I call this the "algorithm" dimension, i.e., a function's algorithm is either "complete" (calculation of Collatz number for each integer) or "partial". The latter obviously requires fewer operations.

Below are the four possible versions of the "collatz" function: array / partial, list / partial, array / complete and list / complete:

import Data.Array ( (!) , listArray , assocs )
import Data.Ord ( comparing )
import Data.List ( maximumBy )

--array implementation; partial algorithm (FEWEST OPERATIONS)
collatzAP x = maximumBy (comparing snd) $ assocs a where
    a = listArray (0,x) (0:1:[c n n | n <- [2..x]])
    c n i = let z = if even i then div i 2 else 3*i+1
      in if i < n then a ! i else 1 + c n z

--list implementation; partial algorithm
collatzLP x = maximum a where   
    a = zip (0:1:[c n n | n <- [2..x]]) [0..x]
    c n i = let z = if even i then div i 2 else 3*i+1
      in if i < n then fst (a!!i) else 1 + c n z

--array implementation, complete algorithm
collatzAC x = maximumBy (comparing snd) $ assocs a where
    a = listArray (0,x) (0:1:[c n n | n <- [2..x]])
    c n i = let z = if even i then div i 2 else 3*i+1
     in if i == 1 then 1 else 1 + c n z     

--list implementation, complete algorithm (MOST OPERATIONS)
collatzLC x = maximum a where   
    a = zip (0:1:[c n n | n <- [2..x]]) [0..x]
    c n i = let z = if even i then div i 2 else 3*i+1
      in if i == 1 then 1 else 1 + c n z

Regarding speed of evaluation: I know that arrays are far faster to access than lists (i.e., O(1) vs. O(n) access time for a given index n) so I expected the 'array' implementation of "collatz" to be faster than the 'list' implementation, ceteris paribus. Also, I expected the 'partial' algorithm to be faster than the 'complete' algorithm (ceteris paribus), given it needs to perform fewer operations in order to construct the dataset of Collatz numbers.

Testing our four functions across inputs of varying size, we observe the following evaluation times (comments below):

enter image description here

It's indeed the case that the 'array/partial' version is the fastest version of "collatz" (by a good margin). However, I find it a bit counter-intuitive that 'list/complete' isn't the slowest version. That honor goes to 'list/partial', which is more than 20x slower than 'list/complete'!

My question: Is the difference in evaluation time between 'list/partial' and 'list/complete' (as compared to that between 'array/partial' and 'array/complete') entirely due to the difference in access efficiency between lists and arrays in Haskell? Or am I not performing a "controlled experiment" (i.e., are there other factors at play)?

like image 237
iceman Avatar asked Sep 29 '22 17:09

iceman


1 Answers

I do not understand how the question about relative performance of two algorithms that work with lists are related to arrays at all...but here is my take:

Try to avoid indexing lists, especially long lists, if performance is of any concern. Indexing is really a traversal (as you know). "List/partial" is indexing/traversing a lot. List/complete is not. Hence the difference between Array/complete and List/complete is negligible, and the different between "list/partial" and the rest is huge.

like image 55
ArunasR Avatar answered Oct 06 '22 18:10

ArunasR