Inspired by this question, I wanted to see if there were any performance differences between iterating over an Array vs a List.
Since we would iterating over the entire collection, my initial thought was that there shouldn't really be a performance difference between the two. Furthermore, I thought that using a tail recursive function to do a count should be as fast as just using a mutable variable. However, when I wrote a simple script to test the difference, I found the following (run in Release Mode with VS2015):
add_k_list, elapsed 15804 ms, result 0L
add_k_list_mutable, elapsed 12800 ms, result 0L
add_k_array, elapsed 15719 ms, result 0L
I'm wonder why the list addition implementation that uses a mutable variable is decently faster than both tail recursive version and the one using a mutable variable and an array.
Here's my code:
open System.Diagnostics
let d = 100000
let n = 100000
let stopWatch =
let sw = Stopwatch ()
sw.Start ()
sw
let testList = [1..d]
let testArray = [|1..d|]
let timeIt (name : string) (a : int -> int list -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a 0 (testList)
for i = 1 to (n) do
a i testList |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, elapsed %d ms, result %A" name d v
let timeItArr (name : string) (a : int -> int [] -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a 0 (testArray)
for i = 1 to (n) do
a i testArray |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, elapsed %d ms, result %A" name d v
let add_k_list x (k_range: int list) =
let rec add k_range x acc =
match k_range with
| [] -> acc
| k::ks -> let y = x ^^^ k
if (y < k || y > d) then
add ks x (acc + 1L)
else
add ks x acc
add k_range x 0L
let add_k_list_mutable x (k_range: int list) =
let mutable count = 0L
for k in k_range do
let y = x ^^^ k
if (y < k || y > d) then
count <- count + 1L
count
let add_k_array x (k_range: int []) =
let mutable count = 0L
for k in k_range do
let y = x ^^^ k
if (y < k || y > d) then
count <- count + 1L
count
[<EntryPoint>]
let main argv =
let x = 5
timeItArr "add_k_array" add_k_array
timeIt "add_k_list" add_k_list
timeIt "add_k_list_mutable" add_k_list_mutable
printfn "%A" argv
0 // return an integer exit code
EDIT: The above test was run on 32bit Release mode in VS2015. At the suggestion of s952163, I ran it at 64 bit and found the results differ quite a bit:
add_k_list, elapsed 17918 ms, result 0L
add_k_list_mutable, elapsed 17898 ms, result 0L
add_k_array, elapsed 8261 ms, result 0L
I'm especially surprised that the difference between using tail recursion with an accumulator vs a mutable variable seems to have disappeared.
Short answer: NET List<T> and Array<T> have the same speed/performance because in . NET List is wrapper around Array .
An Array is a collection of similar items. Whereas ArrayList can hold item of different types. An array is faster and that is because ArrayList uses a fixed amount of array.
The list is better for frequent insertion and deletion, whereas Arrays are much better suited for frequent access of elements scenario. List occupies much more memory as every node defined the List has its own memory set whereas Arrays are memory-efficient data structure.
Array Computations are Faster than For Loops Actually, it is a bad practice in Python to use for loops, list comprehensions, or . apply() in pandas. Instead, you should always prefer array computations. It only took 4.84 seconds!
When running a slightly modified program (posted below) these are the numbers I received:
TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1296 ms, accumulated result 495000099L
add_k_list, elapsed 2675 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 2678 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 869 ms, accumulated result 499624318L
add_k_list, elapsed 2486 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 2483 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 750 ms, accumulated result 507000943L
add_k_list, elapsed 1602 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1603 ms, accumulated result 507000943L
TestRun: Total: 1000000000, Outer: 100, Inner: 10000000
add_k_array, elapsed 1601 ms, accumulated result 495000099L
add_k_list, elapsed 2014 ms, accumulated result 495000099L
add_k_list_mutable, elapsed 1835 ms, accumulated result 495000099L
TestRun: Total: 1000000000, Outer: 1000, Inner: 1000000
add_k_array, elapsed 1495 ms, accumulated result 499624318L
add_k_list, elapsed 1714 ms, accumulated result 499624318L
add_k_list_mutable, elapsed 1595 ms, accumulated result 499624318L
TestRun: Total: 1000000000, Outer: 10000, Inner: 100000
add_k_array, elapsed 1363 ms, accumulated result 507000943L
add_k_list, elapsed 1406 ms, accumulated result 507000943L
add_k_list_mutable, elapsed 1221 ms, accumulated result 507000943L
(As usual it's important to not run with the debugger attached as that changes how the JIT:er works. With debugger attached the JIT:er produces code that is easier for the debugger but also slower.)
The way this works is that the total number of iterations is kept constant but it varies the count of the outer loop and the size of the list/array.
For me the only measurement that is odd is that the array loop is worse in some cases than the list loop.
The answer is most likely related to the CPU cache. When we iterate over an array of size 10,000,000 the actual size of it in memory is 40,000,000 bytes. My machine have "just" 6,000,000 bytes of L3 cache. When the array size is 1,000,000 the size of the array is 4,000,000 bytes which can fit in L3.
The list type in F# is essentially a single-linked list and a rough estimate of the list element is 4(data)+8(64bit pointer)+8(vtable pointer)+4(heap overhead) = 24 bytes. With this estimate the size of a list with 10,000,000 elements is 240,000,000 bytes and with size 1,000,000 elements the size is 24,000,000. Neither fit in the L3 cache on my machine.
When the number of element is 100,000 the size of the array is 400,000 bytes and the list size is 2,400,000. Both fit snugly into the L3 cache.
This reasoning can explain the difference in performance between smaller array/lists compared to bigger ones.
If the elements for the list is not allocated sequentially (ie the heap is fragmented or the GC moved them around) the performance of the list is expected to be much worse if it doesn't fit into the cache because the CPU prefetch strategy no longer works then. The elements in an array is guaranteed to always be sequential thus prefetch will work fine if you iterate sequentially.
This actually isn't true in F# 3 where the for loop is expected to be much slower than the tail-recursion.
For an hint of the answer I used ILSpy to look at the generated IL code.
I found that FSharpList<>::get_TailOrNull()
is called twice per loop when using tail-recursion. Once to check if we reached the end and once to get the next element (redundant call).
The for loop version only calls FSharpList<>::get_TailOrNull()
once.
This extra call likely explains why tail-recursion is slower but as you noted in x64 bit mode both list versions were about as fast. What's going on?
I checked the JIT:ed assembly code and noted that the x64 JIT:er eliminated the extra call to FSharpList<>::get_TailOrNull()
. The x86 JIT:er failed to eliminate the call.
In general I expect that arrays to have the least overhead of all collection in .NET. The reason is that it's compact, sequential and there are special instructions in ILAsm to access the elements.
So it's suprising to me that lists performs better in some cases.
Checking the assembly code again what it seems to come from that the array version requires an extra variable to perform its work and the x86 CPU has few registers available leading to an extra read from the stack per iteration. x64 has significantly more registers leading to that the array version only has to read once from memory per iteration where the list version reads twice (head and tail).
EDIT: OP wondered why it seemed x86 lists performed better x64 lists
I reran the perf tests with list/array size set to 1,000. This will make sure the entire data structure fit into my L1 Cache (256kB)
TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1062 ms, accumulated result 999499999L
add_k_list, elapsed 1134 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1110 ms, accumulated result 999499999L
TestRun: Total: 1000000000, Outer: 1000000, Inner: 1000
add_k_array, elapsed 1617 ms, accumulated result 999499999L
add_k_list, elapsed 1359 ms, accumulated result 999499999L
add_k_list_mutable, elapsed 1100 ms, accumulated result 999499999L
We see that for this size it seems x64 is performing about as well or better than x86. Why do we see the opposite in the other measurements? I speculate that this is because the size of the list elements are larger in the x64 versions meaning we use more bandwidth moving data from L3 to L1.
So if you can try to make sure your data fits into L1 cache.
When working with these sort of questions I sometimes wonder if the whole Von Neumann architecture is a big mistake. Instead we should have a data flow architecture as data is slow and instructions are fast.
AFAIK under the hood CPU:s have a data flow architecture. The assembly language though looks like one would expect from a Von Neumann architecture so in some sense it's a high-level abstraction over the data flow architecture. But in order to provide reasonable performant code the CPU die is mostly occupied by cache (~95%). With a pure data flow architecture one would expect a higher percentage of the CPU die would do actual work.
Hope this was interesting, my modified program follows:
open System.Diagnostics
let stopWatch =
let sw = Stopwatch ()
sw.Start ()
sw
let timeIt (name : string) (outer : int) (a : int -> int64) : unit =
let t = stopWatch.ElapsedMilliseconds
let mutable acc = a 0
for i = 2 to outer do
acc <- acc + a i
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, elapsed %d ms, accumulated result %A" name d acc
let add_k_list x l (k_range: int list) =
let rec add k_range x acc =
match k_range with
| [] -> acc
| k::ks -> let y = x ^^^ k
if (y < k || y > l) then
add ks x (acc + 1L)
else
add ks x acc
add k_range x 0L
let add_k_list_mutable x l (k_range: int list) =
let mutable count = 0L
for k in k_range do
let y = x ^^^ k
if (y < k || y > l) then
count <- count + 1L
count
let add_k_array x l (k_range: int []) =
let mutable count = 0L
for k in k_range do
let y = x ^^^ k
if (y < k || y > l) then
count <- count + 1L
count
[<EntryPoint>]
let main argv =
let total = 1000000000
let outers = [|100; 1000; 10000|]
for outer in outers do
let inner = total / outer
printfn "TestRun: Total: %d, Outer: %d, Inner: %d" total outer inner
ignore <| System.GC.WaitForFullGCComplete ()
let testList = [1..inner]
let testArray = [|1..inner|]
timeIt "add_k_array" outer <| fun x -> add_k_array x inner testArray
timeIt "add_k_list" outer <| fun x -> add_k_list x inner testList
timeIt "add_k_list_mutable" outer <| fun x -> add_k_list_mutable x inner testList
0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With