Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing linear access to arrays with pre-fetching and cache in C

disclosure: I've tried similar question on programmers.stack, but that place is nowhere near activity stack is.

Intro

I tend to work with lots of large images. They also come in sequences of more than one and have to be processed and played back repeatedly. Sometimes I use GPU, sometimes CPU, sometimes both. Most of access patterns are linear in nature (back and forth) which got me thinking about more basic things regarding arrays and how should one approach writing code optimized for maximum memory bandwidth possible on given hardware (permitting computation isn't blocking read/write).

Test specs

  • I've done this on a 2011 MacbookAir4,2 (I5-2557M) with 4GB RAM and SSD. Nothing else was running during tests except iterm2.
  • gcc 5.2.0 (homebrew) with flags: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 with additional include and library flags as well as framework flags in order to use glfw timer which I tend to use. I could've done it without, it doesn't matter. All 64-bit, of course.
  • I've tried tests with optional -fprefetch-loop-arrays flag, but it didn't seem to influence results at all

Test

  • Allocating two arrays of n bytes on the heap - where n is 8, 16, 32, 64, 128, 256, 512 and 1024 MB
  • Initialize array to 0xff, byte at a time
  • Test 1 - linear copy

linear copy:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
  • Test 2 - copying with stride. This is where it gets confusing. I have tried playing the pre-fetch game here. I have tried various combinations of how much should I do per loop and it seems that ~40 per loop yields best performance. Why? I have no idea. I understand that malloc in c99 with uint64_t would give me memory aligned block. I also see sizes of my L1 through L3 caches, which are higher than these 320 bytes, so what am I hitting? Clues might be later in graphs. I would really like to understand this.

stride copy:

for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
  • Test 3 - reading with stride. Same as with copying with stride.

stride read:

    const int imax = 1000;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
  • Test 4 - Linear reading. Byte per byte. I was surprised -fprefetch-loop-arrays yielded no results here. I thought it was for these cases.

linear reading:

for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
  • Test 5 - memcpy as a contrast.

memcpy:

memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));

Results

  • Sample output:

sample output:

Init done in 0.767 s - size of array: 1024 MBs (x2)
Performance: 1304.325 MB/s

Copying (linear) done in 0.898 s
Performance: 1113.529 MB/s

Copying (stride 40) done in 0.257 s
Performance: 3890.608 MB/s

[1000/1000] Performance stride 40: 7474.322 MB/s
Average: 7523.427 MB/s
Performance MIN: 3231 MB/s | Performance MAX: 7818 MB/s

[1000/1000] Performance dumb: 2504.713 MB/s
Average: 2481.502 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2644 MB/s

Copying (memcpy) done in 1.726 s
Performance: 579.485 MB/s

--

Init done in 0.415 s - size of array: 512 MBs (x2)
Performance: 1233.136 MB/s

Copying (linear) done in 0.442 s
Performance: 1157.147 MB/s

Copying (stride 40) done in 0.116 s
Performance: 4399.606 MB/s

[1000/1000] Performance stride 40: 6527.004 MB/s
Average: 7166.458 MB/s
Performance MIN: 4359 MB/s | Performance MAX: 7787 MB/s

[1000/1000] Performance dumb: 2383.292 MB/s
Average: 2409.005 MB/s
Performance MIN: 1673 MB/s | Performance MAX: 2641 MB/s

Copying (memcpy) done in 0.102 s
Performance: 5026.476 MB/s

--

Init done in 0.228 s - size of array: 256 MBs (x2)
Performance: 1124.618 MB/s

Copying (linear) done in 0.242 s
Performance: 1057.916 MB/s

Copying (stride 40) done in 0.070 s
Performance: 3650.996 MB/s

[1000/1000] Performance stride 40: 7129.206 MB/s
Average: 7370.537 MB/s
Performance MIN: 4805 MB/s | Performance MAX: 7848 MB/s

[1000/1000] Performance dumb: 2456.129 MB/s
Average: 2435.556 MB/s
Performance MIN: 1496 MB/s | Performance MAX: 2637 MB/s

Copying (memcpy) done in 0.050 s
Performance: 5095.845 MB/s

-- 

Init done in 0.100 s - size of array: 128 MBs (x2)
Performance: 1277.200 MB/s

Copying (linear) done in 0.112 s
Performance: 1147.030 MB/s

Copying (stride 40) done in 0.029 s
Performance: 4424.513 MB/s

[1000/1000] Performance stride 40: 6497.635 MB/s
Average: 6714.540 MB/s
Performance MIN: 4206 MB/s | Performance MAX: 7843 MB/s

[1000/1000] Performance dumb: 2275.336 MB/s
Average: 2335.544 MB/s
Performance MIN: 1572 MB/s | Performance MAX: 2626 MB/s

Copying (memcpy) done in 0.025 s
Performance: 5086.502 MB/s

-- 

Init done in 0.051 s - size of array: 64 MBs (x2)
Performance: 1255.969 MB/s

Copying (linear) done in 0.058 s
Performance: 1104.282 MB/s

Copying (stride 40) done in 0.015 s
Performance: 4305.765 MB/s

[1000/1000] Performance stride 40: 7750.063 MB/s
Average: 7412.167 MB/s
Performance MIN: 3892 MB/s | Performance MAX: 7826 MB/s

[1000/1000] Performance dumb: 2610.136 MB/s
Average: 2577.313 MB/s
Performance MIN: 2126 MB/s | Performance MAX: 2652 MB/s

Copying (memcpy) done in 0.013 s
Performance: 4871.823 MB/s

-- 

Init done in 0.024 s - size of array: 32 MBs (x2)
Performance: 1306.738 MB/s

Copying (linear) done in 0.028 s
Performance: 1148.582 MB/s

Copying (stride 40) done in 0.008 s
Performance: 4265.907 MB/s

[1000/1000] Performance stride 40: 6181.040 MB/s
Average: 7124.592 MB/s
Performance MIN: 3480 MB/s | Performance MAX: 7777 MB/s

[1000/1000] Performance dumb: 2508.669 MB/s
Average: 2556.529 MB/s
Performance MIN: 1966 MB/s | Performance MAX: 2646 MB/s

Copying (memcpy) done in 0.007 s
Performance: 4617.860 MB/s

--

Init done in 0.013 s - size of array: 16 MBs (x2)
Performance: 1243.011 MB/s

Copying (linear) done in 0.014 s
Performance: 1139.362 MB/s

Copying (stride 40) done in 0.004 s
Performance: 4181.548 MB/s

[1000/1000] Performance stride 40: 6317.129 MB/s
Average: 7358.539 MB/s
Performance MIN: 5250 MB/s | Performance MAX: 7816 MB/s

[1000/1000] Performance dumb: 2529.707 MB/s
Average: 2525.783 MB/s
Performance MIN: 1823 MB/s | Performance MAX: 2634 MB/s

Copying (memcpy) done in 0.003 s
Performance: 5167.561 MB/s

--

Init done in 0.007 s - size of array: 8 MBs (x2)
Performance: 1186.019 MB/s

Copying (linear) done in 0.007 s
Performance: 1147.018 MB/s

Copying (stride 40) done in 0.002 s
Performance: 4157.658 MB/s

[1000/1000] Performance stride 40: 6958.839 MB/s
Average: 7097.742 MB/s
Performance MIN: 4278 MB/s | Performance MAX: 7499 MB/s

[1000/1000] Performance dumb: 2585.366 MB/s
Average: 2537.896 MB/s
Performance MIN: 2284 MB/s | Performance MAX: 2610 MB/s

Copying (memcpy) done in 0.002 s
Performance: 5059.164 MB/s
  • Linear reading is 3 times slower than reading in strides. Stride reading maxes out at approx. 7500-7800 MB/s range. Two things confuse me though. At DDR3 1333 Mhz, max memory throughput should be 10,664 MB/s so why am I not hitting it? Why reading speed isn't more consistent and how would I optimize for that (cache misses?)? It's more apparent from graphs, especially linear reading with regular dips in performance.

Graphs

8-16 MB 8-16 MB

32-64 MB 32-64 MB

128-256 MB 128-256 MB

512-1024 MB 512-1024 MB

All together ALL

Here's full source for anyone interested:

/*
gcc -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 -I "...path to glfw3 includes ..." -L "...path to glfw3 lib ..." arr_test_copy_gnuplot.c -o arr_test_copy_gnuplot -lglfw3 -framework OpenGL -framework Cocoa -framework IOKit -framework CoreVideo

optional: -fprefetch-loop-arrays
*/

#include <stdio.h>
#include <stdlib.h>
#include <string.h> /* memcpy */
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 128 /* GIG */
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    int cx = 0;
    char filename_stride[50];
    char filename_dumb[50];
    cx = snprintf(filename_stride, 50, "%lu_stride.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_stride = fopen(filename_stride, "w");
    cx = snprintf(filename_dumb, 50, "%lu_dumb.dat", 
                    ((ARRAY_NUM*sizeof(uint64_t))/1000000));
    if(cx < 0 || cx >50) { exit(EXIT_FAILURE); }
    FILE *file_dumb   = fopen(filename_dumb, "w");
    if(file_stride == NULL || file_dumb == NULL) {
        perror("Error opening file.");
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double performance  = 0.0;
    double time_start   = 0.0;
    double time_end     = 0.0;
    double performance_min  = 0.0;
    double performance_max  = 0.0;

    /* Init array */
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %.3f s - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Linear copy */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Copying with wide stride */
    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
            array_copy[i] = array[i];
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
            array_copy[i+8] = array[i+8];
            array_copy[i+9] = array[i+9];
            array_copy[i+10] = array[i+10];
            array_copy[i+11] = array[i+11];
            array_copy[i+12] = array[i+12];
            array_copy[i+13] = array[i+13];
            array_copy[i+14] = array[i+14];
            array_copy[i+15] = array[i+15];
            array_copy[i+16] = array[i+16];
            array_copy[i+17] = array[i+17];
            array_copy[i+18] = array[i+18];
            array_copy[i+19] = array[i+19];
            array_copy[i+20] = array[i+20];
            array_copy[i+21] = array[i+21];
            array_copy[i+22] = array[i+22];
            array_copy[i+23] = array[i+23];
            array_copy[i+24] = array[i+24];
            array_copy[i+25] = array[i+25];
            array_copy[i+26] = array[i+26];
            array_copy[i+27] = array[i+27];
            array_copy[i+28] = array[i+28];
            array_copy[i+29] = array[i+29];
            array_copy[i+30] = array[i+30];
            array_copy[i+31] = array[i+31];
            array_copy[i+32] = array[i+32];
            array_copy[i+33] = array[i+33];
            array_copy[i+34] = array[i+34];
            array_copy[i+35] = array[i+35];
            array_copy[i+36] = array[i+36];
            array_copy[i+37] = array[i+37];
            array_copy[i+38] = array[i+38];
            array_copy[i+39] = array[i+39];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 40) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n\n", performance);

    /* Reading with wide stride */
    const int imax = 1000;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+40) {
                tmp = array[i];
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
                tmp = array[i+8];
                tmp = array[i+9];
                tmp = array[i+10];
                tmp = array[i+11];
                tmp = array[i+12];
                tmp = array[i+13];
                tmp = array[i+14];
                tmp = array[i+15];
                tmp = array[i+16];
                tmp = array[i+17];
                tmp = array[i+18];
                tmp = array[i+19];
                tmp = array[i+20];
                tmp = array[i+21];
                tmp = array[i+22];
                tmp = array[i+23];
                tmp = array[i+24];
                tmp = array[i+25];
                tmp = array[i+26];
                tmp = array[i+27];
                tmp = array[i+28];
                tmp = array[i+29];
                tmp = array[i+30];
                tmp = array[i+31];
                tmp = array[i+32];
                tmp = array[i+33];
                tmp = array[i+34];
                tmp = array[i+35];
                tmp = array[i+36];
                tmp = array[i+37];
                tmp = array[i+38];
                tmp = array[i+39];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }

        printf("[%d/%d] Performance stride 40: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_stride, "%d\t%f\n", j, performance);
        fflush(file_stride);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Linear reading */
    performance_average = 0.0;
    performance_min     = 0.0;
    performance_max      = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        if(performance > performance_max) { performance_max = performance; }
        if(j == 0) { performance_min = performance; }
        if(performance < performance_min) { performance_min = performance; }
        printf("[%d/%d] Performance dumb: %.3f MB/s\r", j+1, imax, performance);
        fprintf(file_dumb, "%d\t%f\n", j, performance);
        fflush(file_dumb);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %.3f MB/s\n", performance_average);
    printf("Performance MIN: %3.f MB/s | Performance MAX: %3.f MB/s\n\n", 
            performance_min, performance_max);

    /* Memcpy */
    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %.3f s\n", (time_end - time_start));
    printf("Performance: %.3f MB/s\n", performance);

    /* Cleanup and exit */
    free(array);
    free(array_copy);
    glfwTerminate();
    fclose(file_dumb);
    fclose(file_stride);

    exit(EXIT_SUCCESS);
}

Summary

  • How should I write code to have maximum and (near)constant speed when working with arrays where linear access is most common pattern?
  • What can I learn about cache and pre-fetch from this example?
  • Are these graphs telling me something I should know that I haven't noticed?
  • How else can I unroll loops? I have tried -funroll-loops to no results, so I have resorted to manually writing loop-in-loop unrolls.

Thanks for the longs read.

EDIT:

It seems -O0 gives different performance from when -O flag is absent! What gives? Flag being absent yields better performance, as can be seen in the graph.

O flag absent

EDIT2:

I have finally hit the ceiling with AVX.

=== READING WITH AVX ===
[1000/1000] Performance AVX: 9868.912 MB/s
Average: 10029.085 MB/s
Performance MIN: 6554 MB/s | Performance MAX: 11464 MB/s

Average being really close to 10664. I had to change compiler to clang because gcc was giving me a hard time for using avx (-mavx). This is also why graph has more pronounced dips. I would still like to know how to / what is / have constant performance. I presume this is due to caching / cache lines. It would also explain performance going above DDR3 speed here and there (MAX was 11464 MB/s).

Excuse my gnuplot-fu and its keys. Blue is SSE2 ( _mm_load_si128 ) and orange is AVX ( _mm256_load_si256 ). Purple is strided as before and green is dumb reading one at a time.

King AVX

So, final two questions are:

  • What is causing dips and how to have more constant performance
  • Is it possible to hit ceiling without intrinsics?

gist with latest version: https://gist.github.com/Keyframe/1ed9062ec52fc4a0d14b and graphs from that version: http://imgur.com/a/cPeor

like image 333
Keyframe Avatar asked Nov 12 '15 16:11

Keyframe


3 Answers

Your value for the peak bandwidth from main memory is off by a factor of two. Instead of it 10664 MB/s it should be 21.3 GB/s (more precisely it should be (21333⅓) MB/s - see my derivation below). The fact that you see more than 10664 MB/s sometimes should have told you that maybe there was a problem in your peak bandwidth calculation.

In order to get the maximum bandwidth for Core2 through Sandy Bridge you need to use non-temporal stores. Additionally, you need multiple threads. You don't need AVX instructions or to unroll the loop.

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&y[16*i], _mm_load_ps((float*)&x[16*i]));
    }
}

The arrays need to be 16 byte aligned and also be a multiple of 16. The rule of thumb for non-temporal stores is to use them when the memory you are copying is larger than half the size of last level cache. In your case half the L3 cache size is 1.5 MB and the smallest array you copy is 8 MB so this is much larger than half the last level cache size.

Here is some code to test this.

//gcc -O3 -fopenmp foo.c
#include <stdio.h>
#include <x86intrin.h>
#include <string.h>
#include <omp.h>

void copy(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_stream_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

void copy2(char *x, char *y, int n)
{
    #pragma omp parallel for schedule(static)
    for(int i=0; i<n/16; i++)
    {
        _mm_store_ps((float*)&x[16*i], _mm_load_ps((float*)&y[16*i]));
    }
}

int main(void)
{
    unsigned n = 0x7fffffff;
    char *x = _mm_malloc(n, 16);
    char *y = _mm_malloc(n, 16);
    double dtime;

    memset(x,0,n);
    memset(y,1,n);

    dtime = -omp_get_wtime();
    copy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    copy2(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    memcpy(x,y,n);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);  
}

On my system, Core2 (before Nehalem) [email protected], it gives

time non temporal store 0.39
time SSE store          1.10
time memcpy             0.98

to copy 2GB.

Note that it's very important that you "touch" the memory you will write to first (I used memset to do this). Your system does not necessarily allocate your memory until you access it. The overhead to do this can bias your results significantly if the memory has not been accesses when you do the memory copy.


According to wikipedia DDR3-1333 has a memory clock of 166⅔ MHz. DDR transfers data at twice memory clock rate. Additionally, DDR3 has a bus clock multiplier of four. So DDR3 has a total multiply per memory clock of eight. Additionally, your motherboard has two memory channels. So the total transfer rate is

 21333⅓ MB/s = (166⅔ 1E6 clocks/s) * (8 lines/clock/channel) * (2 channels) * (64-bits/line) * (byte/8-bits) * (MB/1E6 bytes).
like image 107
Z boson Avatar answered Oct 19 '22 23:10

Z boson


For the things you are doing I would look at SIMD (Single Instruction Multiple Data), google for GCC Compiler Intrinsics for details

like image 21
Marek Waszkiewicz Avatar answered Oct 20 '22 00:10

Marek Waszkiewicz


You should compile with a recent GCC (so having compiled your GCC 5.2 is a good idea, in November 2015), and you should enable optimizations for your particular platform, so I suggest compiling with gcc -Wall -O2 -march=native at least (try also to replace -O2 with -O3).

(Don't benchmark your programs without enabling optimizations in your compiler)

If you are concerned with cache effects, you might play with __builtin_prefetch, but see this.

Read also about OpenMP, OpenCL, OpenACC.

like image 21
Basile Starynkevitch Avatar answered Oct 19 '22 23:10

Basile Starynkevitch