Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast(est) way to write a seqence of integer to global memory?

The task is very simple, writting a seqence of integer variable to memory:

Original code:

for (size_t i=0; i<1000*1000*1000; ++i)
{
   data[i]=i;
};

Parallelized code:

    size_t stepsize=len/N;

#pragma omp parallel num_threads(N)
    {
        int threadIdx=omp_get_thread_num();

        size_t istart=stepsize*threadIdx;
        size_t iend=threadIdx==N-1?len:istart+stepsize;
#pragma simd
        for (size_t i=istart; i<iend; ++i)
            x[i]=i;
    };

The performance sucks, it takes 1.6 sec to writing 1G uint64 variables (which is equal to 5GB per sec), by simple parallelization (open mp parallel)of the above code, the speed increase abit, but performance still sucks, take 1.4 sec with 4 threads and 1.35 with 6 threads on a i7 3970.

The theortical memory bandwidth of my rig (i7 3970/64G DDR3-1600) is 51.2 GB/sec, for the above example, the achieved memory bandwidth is only about 1/10 of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded.

Anyone know how to improve the code?

I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth).

EDIT:

The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization.

UPDATE:

I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization.

As for why I want to fill the numbers there, well, long story short:

Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory.

Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (238GB/sec out of 288GB/sec on GK110 vs a pathetic 5GB/sec out of 51.2GB/sec on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here.

As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my memcpy() test, the achieved bandwidth is about 80%+ of the theoritical bandwidth (>40GB/sec).

like image 669
user2188453 Avatar asked Aug 23 '13 13:08

user2188453


2 Answers

Assuming this is x86, and that you are not already saturating your available DRAM bandwidth, you can try using SSE2 or AVX2 to write 2 or 4 elements at a time:

SSE2:

#include "emmintrin.h"

const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);

for (size_t i=0; i<1000*1000*1000; i += 2)
{
    _mm_stream_si128((__m128i *)&data[i], v);
    v = _mm_add_epi64(v, v2);
}

AVX2:

#include "immintrin.h"

const __m256i v4 = _mm256_set1_epi64x(4);
__m256i v = _mm256_set_epi64x(3, 2, 1, 0);

for (size_t i=0; i<1000*1000*1000; i += 4)
{
    _mm256_stream_si256((__m256i *)&data[i], v);
    v = _mm256_add_epi64(v, v4);
}

Note that data needs to be suitably aligned (16 byte or 32 byte boundary).

AVX2 is only available on Intel Haswell and later, but SSE2 is pretty much universal these days.


FWIW I put together a test harness with a scalar loop and the above SSE and AVX loops compiled it with clang, and tested it on a Haswell MacBook Air (1600MHz LPDDR3 DRAM). I got the following results:

# sequence_scalar: t = 0.870903 s = 8.76033 GB / s
# sequence_SSE: t = 0.429768 s = 17.7524 GB / s
# sequence_AVX: t = 0.431182 s = 17.6941 GB / s

I also tried it on a Linux desktop PC with a 3.6 GHz Haswell, compiling with gcc 4.7.2, and got the following:

# sequence_scalar: t = 0.816692 s = 9.34183 GB / s
# sequence_SSE: t = 0.39286 s = 19.4201 GB / s
# sequence_AVX: t = 0.392545 s = 19.4357 GB / s

So it looks like the SIMD implementations give a 2x or more improvement over 64 bit scalar code (although 256 bit SIMD doesn't seem to give any improvement over 128 bit SIMD), and that typical throughput should be a lot faster than 5 GB / s.

My guess is that there is something wrong with the OP's system or benchmarking code which is resulting in an apparently reduced throughput.

like image 53
Paul R Avatar answered Nov 01 '22 15:11

Paul R


Is there any reason why you would expect all of data[] to be in powered-up RAM pages?

The DDR3 pre-fetchter will correctly predict most accesses but the frequent x86-64 page boundaries might be an issue. You're writing to virtual memory, so at each page boundary there's a potential mis-prediction of the pre-fetcher. You can greatly reduce this by using large pages (e.g. MEM_LARGE_PAGES on Windows).

like image 42
MSalters Avatar answered Nov 01 '22 16:11

MSalters