Performance of iterating over Array vs List

Tags:

f#

Inspired by this question, I wanted to see if there were any performance differences between iterating over an Array vs a List.

Since we would iterating over the entire collection, my initial thought was that there shouldn't really be a performance difference between the two. Furthermore, I thought that using a tail recursive function to do a count should be as fast as just using a mutable variable. However, when I wrote a simple script to test the difference, I found the following (run in Release Mode with VS2015):

add_k_list, elapsed 15804 ms, result 0L
add_k_list_mutable, elapsed 12800 ms, result 0L
add_k_array, elapsed 15719 ms, result 0L

I'm wonder why the list addition implementation that uses a mutable variable is decently faster than both tail recursive version and the one using a mutable variable and an array.

Here's my code:

open System.Diagnostics

let d = 100000
let n = 100000

let stopWatch = 
  let sw = Stopwatch ()
  sw.Start ()
  sw

let testList = [1..d]
let testArray = [|1..d|]

let timeIt (name : string) (a : int ->  int list -> 'T) : unit = 
  let t = stopWatch.ElapsedMilliseconds
  let v = a 0 (testList)
  for i = 1 to (n) do
    a i testList |> ignore
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, result %A" name d v

let timeItArr (name : string) (a : int -> int [] -> 'T) : unit = 
  let t = stopWatch.ElapsedMilliseconds
  let v = a 0 (testArray)
  for i = 1 to (n) do
    a i testArray |> ignore
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, result %A" name d v

let add_k_list x (k_range: int list) =
    let rec add k_range x acc =  
        match k_range with
        | [] -> acc
        | k::ks -> let y = x ^^^ k
                   if (y < k || y > d) then
                       add ks x (acc + 1L)
                   else
                       add ks x acc
    add k_range x 0L


let add_k_list_mutable x (k_range: int list) = 
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > d) then
            count <- count + 1L
    count

let add_k_array x (k_range: int []) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > d) then
            count <- count + 1L
    count

[<EntryPoint>]
let main argv = 
    let x = 5
    timeItArr "add_k_array" add_k_array
    timeIt "add_k_list" add_k_list
    timeIt "add_k_list_mutable" add_k_list_mutable
    printfn "%A" argv
    0 // return an integer exit code

EDIT: The above test was run on 32bit Release mode in VS2015. At the suggestion of s952163, I ran it at 64 bit and found the results differ quite a bit:

add_k_list, elapsed 17918 ms, result 0L
add_k_list_mutable, elapsed 17898 ms, result 0L
add_k_array, elapsed 8261 ms, result 0L

I'm especially surprised that the difference between using tail recursion with an accumulator vs a mutable variable seems to have disappeared.

650

asked Jun 15 '16 01:06

Ringil

1 Answers

When running a slightly modified program (posted below) these are the numbers I received:

x64 Release .NET 4.6.1

TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1296 ms, accumulated result 495000099L
add_k_list, elapsed 2675 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 2678 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 869 ms, accumulated result 499624318L
add_k_list, elapsed 2486 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 2483 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 750 ms, accumulated result 507000943L
add_k_list, elapsed 1602 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1603 ms, accumulated result 507000943L

x86 Release .NET 4.6.1

TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1601 ms, accumulated result 495000099L
add_k_list, elapsed 2014 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 1835 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 1495 ms, accumulated result 499624318L
add_k_list, elapsed 1714 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 1595 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 1363 ms, accumulated result 507000943L
add_k_list, elapsed 1406 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1221 ms, accumulated result 507000943L

(As usual it's important to not run with the debugger attached as that changes how the JIT:er works. With debugger attached the JIT:er produces code that is easier for the debugger but also slower.)

The way this works is that the total number of iterations is kept constant but it varies the count of the outer loop and the size of the list/array.

For me the only measurement that is odd is that the array loop is worse in some cases than the list loop.

If the total amount of work is the same why do we see different results when outer/inner is varied?

The answer is most likely related to the CPU cache. When we iterate over an array of size 10,000,000 the actual size of it in memory is 40,000,000 bytes. My machine have "just" 6,000,000 bytes of L3 cache. When the array size is 1,000,000 the size of the array is 4,000,000 bytes which can fit in L3.

The list type in F# is essentially a single-linked list and a rough estimate of the list element is 4(data)+8(64bit pointer)+8(vtable pointer)+4(heap overhead) = 24 bytes. With this estimate the size of a list with 10,000,000 elements is 240,000,000 bytes and with size 1,000,000 elements the size is 24,000,000. Neither fit in the L3 cache on my machine.

When the number of element is 100,000 the size of the array is 400,000 bytes and the list size is 2,400,000. Both fit snugly into the L3 cache.

This reasoning can explain the difference in performance between smaller array/lists compared to bigger ones.

If the elements for the list is not allocated sequentially (ie the heap is fragmented or the GC moved them around) the performance of the list is expected to be much worse if it doesn't fit into the cache because the CPU prefetch strategy no longer works then. The elements in an array is guaranteed to always be sequential thus prefetch will work fine if you iterate sequentially.

Why is tail-recursion slower than the mutable for loop?

This actually isn't true in F# 3 where the for loop is expected to be much slower than the tail-recursion.

For an hint of the answer I used ILSpy to look at the generated IL code.

I found that FSharpList<>::get_TailOrNull() is called twice per loop when using tail-recursion. Once to check if we reached the end and once to get the next element (redundant call).

The for loop version only calls FSharpList<>::get_TailOrNull() once.

This extra call likely explains why tail-recursion is slower but as you noted in x64 bit mode both list versions were about as fast. What's going on?

I checked the JIT:ed assembly code and noted that the x64 JIT:er eliminated the extra call to FSharpList<>::get_TailOrNull(). The x86 JIT:er failed to eliminate the call.

Lastly, why is array version slower than the list version on x86?

In general I expect that arrays to have the least overhead of all collection in .NET. The reason is that it's compact, sequential and there are special instructions in ILAsm to access the elements.

So it's suprising to me that lists performs better in some cases.

Checking the assembly code again what it seems to come from that the array version requires an extra variable to perform its work and the x86 CPU has few registers available leading to an extra read from the stack per iteration. x64 has significantly more registers leading to that the array version only has to read once from memory per iteration where the list version reads twice (head and tail).

Any conclusions?

When it comes to CPU performance x64 is the way to go (this hasn't always been the case)
Expects arrays to perform better than any data structure in .NET for operations where array operations are O(1) (inserts are slow obviously)
The devil is in the details meaning in order to gain true insight we might need to check the assembly code.
Cache locality is very important for large collections. Since arrays are compact and guaranteed to be sequential they are often a good choice.
It's very difficult to predict performance, always measure
Iterate towards zero when possible if you are really hungry for performance. This can save one read from memory.

EDIT: OP wondered why it seemed x86 lists performed better x64 lists

I reran the perf tests with list/array size set to 1,000. This will make sure the entire data structure fit into my L1 Cache (256kB)

x64 Release .NET 4.6.1

TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1062 ms, accumulated result 999499999L
add_k_list, elapsed 1134 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1110 ms, accumulated result 999499999L

x86 Release .NET 4.6.1

TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1617 ms, accumulated result 999499999L
add_k_list, elapsed 1359 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1100 ms, accumulated result 999499999L

We see that for this size it seems x64 is performing about as well or better than x86. Why do we see the opposite in the other measurements? I speculate that this is because the size of the list elements are larger in the x64 versions meaning we use more bandwidth moving data from L3 to L1.

So if you can try to make sure your data fits into L1 cache.

Final musings

When working with these sort of questions I sometimes wonder if the whole Von Neumann architecture is a big mistake. Instead we should have a data flow architecture as data is slow and instructions are fast.

AFAIK under the hood CPU:s have a data flow architecture. The assembly language though looks like one would expect from a Von Neumann architecture so in some sense it's a high-level abstraction over the data flow architecture. But in order to provide reasonable performant code the CPU die is mostly occupied by cache (~95%). With a pure data flow architecture one would expect a higher percentage of the CPU die would do actual work.

Hope this was interesting, my modified program follows:

open System.Diagnostics

let stopWatch =
  let sw = Stopwatch ()
  sw.Start ()
  sw

let timeIt (name : string) (outer : int) (a : int -> int64) : unit =
  let t = stopWatch.ElapsedMilliseconds
  let mutable acc = a 0
  for i = 2 to outer do
    acc <- acc + a i
  let d = stopWatch.ElapsedMilliseconds - t
  printfn "%s, elapsed %d ms, accumulated result %A" name d acc

let add_k_list x l (k_range: int list) =
    let rec add k_range x acc =
        match k_range with
        | [] -> acc
        | k::ks -> let y = x ^^^ k
                   if (y < k || y > l) then
                       add ks x (acc + 1L)
                   else
                       add ks x acc
    add k_range x 0L


let add_k_list_mutable x l (k_range: int list) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > l) then
            count <- count + 1L
    count

let add_k_array x l (k_range: int []) =
    let mutable count = 0L
    for k in k_range do
        let y = x ^^^ k
        if (y < k || y > l) then
            count <- count + 1L
    count
[<EntryPoint>]
let main argv =
  let total = 1000000000
  let outers = [|100; 1000; 10000|]

  for outer in outers do
    let inner = total / outer
    printfn "TestRun: Total: %d, Outer: %d, Inner: %d" total outer inner

    ignore <| System.GC.WaitForFullGCComplete ()

    let testList  = [1..inner]
    let testArray = [|1..inner|]

    timeIt    "add_k_array"         outer <| fun x -> add_k_array         x inner testArray
    timeIt    "add_k_list"          outer <| fun x -> add_k_list          x inner testList
    timeIt    "add_k_list_mutable"  outer <| fun x -> add_k_list_mutable  x inner testList

  0

answered Oct 01 '22 22:10

Just another metaprogrammer

Related questions
                            
                                What image format can c# load fastest?
                            
                                TSQL - What is the fastest way to check for more than one record?
                            
                                SSE with doubles, not worth it?
                            
                                Can I cache $(window) and $(document) in jQuery?
                            
                                Is it REALLY true that Python code runs faster in a function?
                            
                                What is the biggest performance issue in Emberjs?
                            
                                Why is it always faster to access object properties than array items? [closed]
                            
                                jQuery : which is faster in this case (.find) vs (.filter) [closed]
                            
                                Javascript array deletion for garbage collection
                            
                                Enhanced for loop
                            
                                Why does JPA/Eclipselink fire redundant database queries even with the BatchFetchType.IN hint?
                            
                                How to avoid overhead of continuous ajax request on keyup event?
                            
                                High performance string hashing function in Java/Scala [closed]
                            
                                Nested-loop React components with Flux, change-listeners on parent or children?
                            
                                Android: Slow performance using a big image in ImageView
                            
                                prevent rebuilding whole docker container everytime? improving speed
                            
                                Make SFML Snake game more responsive to key presses?
                            
                                Fastest way to calculate Euclidean distance in c
                            
                                performance type varchar(1) or smallint to store status Postgres
                            
                                Why is Vulkan's VkBool32 implemented as an unsigned int?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Performance of iterating over Array vs List

Tags:

performance

f#

Ringil

People also ask

1 Answers

x64 Release .NET 4.6.1

x86 Release .NET 4.6.1

If the total amount of work is the same why do we see different results when outer/inner is varied?

Why is tail-recursion slower than the mutable for loop?

Lastly, why is array version slower than the list version on x86?

Any conclusions?

x64 Release .NET 4.6.1

x86 Release .NET 4.6.1

Final musings

Just another metaprogrammer

Recent Activity

Donate For Us