I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results. Version 1: <pre class="prettyprint"><code>func newMatrix(n int) [][]int { m := make([][]int, n) buf := make([]int, n*n) for i := range m { m[i] = buf[i*n : (i+1)*n] } return m } func mult1(m1, m2, res [][]int) [][]int { for i := range m1 { for k := range m1[0] { for j := range m2[0] { res[i][j] += m1[i][k] * m2[k][j] } } } return res } </code></pre> From the linear array i create multiple slices that represent the matrix rows. Version 2: <pre class="prettyprint"><code>func mult2(m1, m2, res []int, n int) []int { for i := 0; i < n; i++ { for k := 0; k < n; k++ { for j := 0; j < n; j++ { res[i*n+j] += m1[i*n+k] * m2[k*n+j] } } } return res } </code></pre> In this version I simply use a linear array and index into it from the multiplication. Multiplying 2 2048x2048 matrices give the following execution time: <pre class="prettyprint"><code> version 1: 35.550813801s version 2: 19.090223468s </code></pre> Version 2 is almost twice as fast. I used the approach below to take the measurements: <pre class="prettyprint"><code>start := time.Now() mult(m1, m2, m3) stop := time.Now() </code></pre> I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference. So I guess I'm asking is this expected behavior or is there something I'm missing? Software and hardware: Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.

The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to: <ul> <li> More generated instructions for the same code. The compiler could have trouble in determining when to use packed versions of SIMD instructions (e.g. SSE, AVX). You can verify this by dumping the assembly code, look for XMM or YMM registers and check if the instructions operating on the registers are packed.</li> <li> You make it difficult for the compiler to add software prefetches. Because indirect addressing, it's difficult for the compiler to detect how to add software prefetches. You can look for vprefetch instructions in the assembly code.</li> <li> The hardware prefetcher will be less efficient also due to indirect addressing. You first need to access the line start address and then access the line elements so it's hard to observe that the hardware prefetcher should just fetch consecutive addresses. That's only measurable through profiling like perf.</li> </ul> So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

I was doing some performance experimentation in Go with matrix multiplication and ran into some unexpected results.

Version 1:

func newMatrix(n int) [][]int {
    m := make([][]int, n)
    buf := make([]int, n*n)

    for i := range m {
        m[i] = buf[i*n : (i+1)*n]
    }

    return m
}

func mult1(m1, m2, res [][]int) [][]int {
    for i := range m1 {
        for k := range m1[0] {
            for j := range m2[0] {
                res[i][j] += m1[i][k] * m2[k][j]
            }
        }
    }

    return res
}

From the linear array i create multiple slices that represent the matrix rows.

Version 2:

func mult2(m1, m2, res []int, n int) []int {
    for i := 0; i < n; i++ {
        for k := 0; k < n; k++ {
            for j := 0; j < n; j++ {
                res[i*n+j] += m1[i*n+k] * m2[k*n+j]
            }
        }
    }

    return res
}

In this version I simply use a linear array and index into it from the multiplication.

Multiplying 2 2048x2048 matrices give the following execution time:

 version 1: 35.550813801s
 version 2: 19.090223468s

Version 2 is almost twice as fast.

I used the approach below to take the measurements:

start := time.Now()
mult(m1, m2, m3)
stop := time.Now()

I am aware using slices give another layer of indirection which could impact the cache performance, however I didn't expect it would be such a big difference. Unfortunately I haven't found any good tool, that works with Mac, that can analyse cache efficiency in Go, so I can't say for sure if this is what's causing the performance difference.

So I guess I'm asking is this expected behavior or is there something I'm missing?

Software and hardware: Go version 1.4.2 darwin/amd64; OS X 10.10.3; 2 GHz quad-core i7.

Are arrays faster than slices Golang?

The performance of arrays is better than slices when accessing a single element. The length of the array is part of the type. It has a certain meaning in a specific scenario.

How do you make a 2D array in Go?

To create a 2D array we must specify each dimension. We can then assign individual elements within the array. Here we create a 2 by 2 array of strings. Tip Arrays in Go have fixed sizes.

What is difference between array and slice in Golang?

Slices in Go and Golang The basic difference between a slice and an array is that a slice is a reference to a contiguous segment of an array. Unlike an array, which is a value-type, slice is a reference type. A slice can be a complete array or a part of an array, indicated by the start and end index.

How do I remove slice Go?

Removing an Element from a Slice Unlike other languages, Go does not provide any built-in functions to remove an element from a slice. Items need to be removed from a slice by slicing them out.

The main problem in your version 1 code seems to be indirect addressing. Even though the layout in memory for the matrices in both versions is the same, using indirect addressing can lead to:

More generated instructions for the same code. The compiler could have trouble in determining when to use packed versions of SIMD instructions (e.g. SSE, AVX). You can verify this by dumping the assembly code, look for XMM or YMM registers and check if the instructions operating on the registers are packed.
You make it difficult for the compiler to add software prefetches. Because indirect addressing, it's difficult for the compiler to detect how to add software prefetches. You can look for vprefetch instructions in the assembly code.
The hardware prefetcher will be less efficient also due to indirect addressing. You first need to access the line start address and then access the line elements so it's hard to observe that the hardware prefetcher should just fetch consecutive addresses. That's only measurable through profiling like perf.

So in case of version 1, indirect addressing is the main issue. I also recommend running the 2 codes in multiple iterations to remove the cache warming penalty which might be higher for version 1 because of what I explained above.

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

Tags:

performance

memory-management

multidimensional-array

optimization

go

Carlj901

People also ask

1 Answers

VAndrei

Recent Activity

Donate For Us

Go: Unexpected performance when accessing an array through slice of slices (2D slice)

Tags:

performance

memory-management

multidimensional-array

optimization

go

Carlj901

People also ask

1 Answers

VAndrei

Related questions

Recent Activity

Donate For Us