Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?

Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program.

In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment.

I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler.

My first attempt was:

const char* bytevecM;    // 4 GB block of memory, 64-byte aligned
size_t* psz;             // size_t is 64-bits
// ...
// "m7 & 0xffffff80" selects the 128 byte block to test for all zeros
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0]  == 0 && psz[1]  == 0
&&  psz[2]  == 0 && psz[3]  == 0
&&  psz[4]  == 0 && psz[5]  == 0
&&  psz[6]  == 0 && psz[7]  == 0
&&  psz[8]  == 0 && psz[9]  == 0
&&  psz[10] == 0 && psz[11] == 0
&&  psz[12] == 0 && psz[13] == 0
&&  psz[14] == 0 && psz[15] == 0) continue;
// ...

VTune profiling of the corresponding assembly follows:

cmp    qword ptr [rax],      0x0       0.171s
jnz    0x14000222                     42.426s
cmp    qword ptr [rax+0x8],  0x0       0.498s
jnz    0x14000222                      0.358s
cmp    qword ptr [rax+0x10], 0x0       0.124s
jnz    0x14000222                      0.031s
cmp    qword ptr [rax+0x18], 0x0       0.171s
jnz    0x14000222                      0.031s
cmp    qword ptr [rax+0x20], 0x0       0.233s
jnz    0x14000222                      0.560s
cmp    qword ptr [rax+0x28], 0x0       0.498s
jnz    0x14000222                      0.358s
cmp    qword ptr [rax+0x30], 0x0       0.140s
jnz    0x14000222
cmp    qword ptr [rax+0x38], 0x0       0.124s
jnz    0x14000222
cmp    qword ptr [rax+0x40], 0x0       0.156s
jnz    0x14000222                      2.550s
cmp    qword ptr [rax+0x48], 0x0       0.109s
jnz    0x14000222                      0.124s
cmp    qword ptr [rax+0x50], 0x0       0.078s
jnz    0x14000222                      0.016s
cmp    qword ptr [rax+0x58], 0x0       0.078s
jnz    0x14000222                      0.062s
cmp    qword ptr [rax+0x60], 0x0       0.093s
jnz    0x14000222                      0.467s
cmp    qword ptr [rax+0x68], 0x0       0.047s
jnz    0x14000222                      0.016s
cmp    qword ptr [rax+0x70], 0x0       0.109s
jnz    0x14000222                      0.047s
cmp    qword ptr [rax+0x78], 0x0       0.093s
jnz    0x14000222                      0.016s

I was able to improve on that via Intel instrinsics:

const char* bytevecM;                        // 4 GB block of memory
__m128i* psz;                                // __m128i is 128-bits
__m128i one = _mm_set1_epi32(0xffffffff);    // all bits one
// ...
psz = (__m128i*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (_mm_testz_si128(psz[0], one) && _mm_testz_si128(psz[1], one)
&&  _mm_testz_si128(psz[2], one) && _mm_testz_si128(psz[3], one)
&&  _mm_testz_si128(psz[4], one) && _mm_testz_si128(psz[5], one)
&&  _mm_testz_si128(psz[6], one) && _mm_testz_si128(psz[7], one)) continue;
// ...

VTune profiling of the corresponding assembly follows:

movdqa xmm0, xmmword ptr [rax]         0.218s
ptest  xmm0, xmm2                     35.425s
jnz    0x14000ddd                      0.700s
movdqa xmm0, xmmword ptr [rax+0x10]    0.124s
ptest  xmm0, xmm2                      0.078s
jnz    0x14000ddd                      0.218s
movdqa xmm0, xmmword ptr [rax+0x20]    0.155s
ptest  xmm0, xmm2                      0.498s
jnz    0x14000ddd                      0.296s
movdqa xmm0, xmmword ptr [rax+0x30]    0.187s
ptest  xmm0, xmm2                      0.031s
jnz    0x14000ddd
movdqa xmm0, xmmword ptr [rax+0x40]    0.093s
ptest  xmm0, xmm2                      2.162s
jnz    0x14000ddd                      0.280s
movdqa xmm0, xmmword ptr [rax+0x50]    0.109s
ptest  xmm0, xmm2                      0.031s
jnz    0x14000ddd                      0.124s
movdqa xmm0, xmmword ptr [rax+0x60]    0.109s
ptest  xmm0, xmm2                      0.404s
jnz    0x14000ddd                      0.124s
movdqa xmm0, xmmword ptr [rax+0x70]    0.093s
ptest  xmm0, xmm2                      0.078s
jnz    0x14000ddd                      0.016s

As you can see, there are fewer assembly instructions and this version further proved to be faster in timing tests.

Since I am quite weak in the area of Intel SSE/AVX instructions, I welcome advice on how they might be better employed to speed up this code.

Though I scoured the hundreds of available instrinsics, I may have missed the ideal ones. In particular, I was unable to effectively employ _mm_cmpeq_epi64(); I looked for a "not equal" version of this instrinsic (which seems better suited to this problem) but came up dry. Though the below code "works":

if (_mm_testz_si128(_mm_andnot_si128(_mm_cmpeq_epi64(psz[7], _mm_andnot_si128(_mm_cmpeq_epi64(psz[6], _mm_andnot_si128(_mm_cmpeq_epi64(psz[5], _mm_andnot_si128(_mm_cmpeq_epi64(psz[4], _mm_andnot_si128(_mm_cmpeq_epi64(psz[3], _mm_andnot_si128(_mm_cmpeq_epi64(psz[2], _mm_andnot_si128(_mm_cmpeq_epi64(psz[1], _mm_andnot_si128(_mm_cmpeq_epi64(psz[0], zero), one)), one)), one)), one)), one)), one)), one)), one), one)) continue;

it is borderline unreadable and (unsurprisingly) proved to be way slower than the two versions given above. I feel sure there must be a more elegant way to employ _mm_cmpeq_epi64() and welcome advice on how that might be achieved.

In addition to using intrinsics from C, raw Intel assembly language solutions to this problem are also welcome.

like image 643
eyepopslikeamosquito Avatar asked Mar 02 '13 07:03

eyepopslikeamosquito


3 Answers

The main problem, as others have pointed out, is that the 128-byte data you are checking is missing the data cache and/or the TLB and going to DRAM, which is slow. VTune is telling you this

cmp    qword ptr [rax],      0x0       0.171s
jnz    0x14000222                     42.426s

You have another, smaller, hotspot half-way down

cmp    qword ptr [rax+0x40], 0x0       0.156s
jnz    0x14000222                      2.550s

Those 42.4 + 2.5 seconds accounted to the JNZ instructions are really a stall caused by the previous load from memory... the processor is sitting around doing nothing for 45 seconds total over the time you profiled the program...waiting on DRAM.

You might ask what the 2nd hotspot half-way down is all about. Well, you are accessing 128-bytes and cache lines are 64-bytes, the CPU started prefetching for you as soon as it read the first 64-bytes... but you didn't do enough work with the first 64-bytes to totally overlap the latency of going to memory.

The memory bandwidth of Ivy Bridge is very high (it depends on your system, but I'm guessing over 10 GB/sec). Your block of memory is 4GB, you should be able to zip thru it in less than 1 second if you access it sequentially and let the CPU prefetch data ahead for you.

My guess is you are thwarting the CPU data prefetcher by accessing the 128-byte blocks in a non-contiguous fashion.

Change your access pattern to be sequential and you'll be surprised how much faster it runs. You can then worry about the next level of optimization, which will be making sure the branch prediction works well.

Another thing to consider is TLB misses. Those are costly, especially in a 64-bit system. Rather than using 4KB pages consider using 2MB huge pages. See this link for Windows support for these: Large-Page Support (Windows)

If you must access the 4GB data in a somewhat random fashion, but you know ahead of time the sequence of m7 values (your index) then you can pipeline the memory fetching explicitly ahead of your use (it needs to be several 100 CPU cycles ahead of when you will be using it to be effective). See

  • _mm_prefetch
  • usage of "_mm_prefetch(...)"
  • MM_PREFETCH

Here are some links that might be helpful in general on the subject of memory optimizations

What Every Programmer Should Know About Memory by Ulrich Drepper

http://www.akkadia.org/drepper/cpumemory.pdf

Machine Architecture: Things Your Programming Language Never Told You, by Herb Sutter

http://www.gotw.ca/publications/concurrency-ddj.htm

http://nwcpp.org/static/talks/2007/Machine_Architecture_-_NWCPP.pdf

http://video.google.com/videoplay?docid=-4714369049736584770#

like image 140
amdn Avatar answered Oct 01 '22 20:10

amdn


Sorry for the answer post, I don't have enough reputation for comments.
What happens if you use the following as a test?

if( (psz[0]  | psz[1]  | psz[2]  | psz[3]  |
     psz[4]  | psz[5]  | psz[6]  | psz[7]  |
     psz[8]  | psz[9]  | psz[10] | psz[11] |
     psz[12] | psz[13] | psz[14] | psz[15] ) == 0) continue;

Unfortunately, I don't have a 64-bit system on which to compile it, and I am unfamiliar with what exactly the compiler does with c code, but it would seem to me that a binary or would be faster than individual == comparisons. I also don't know what Intel intrinsics are, but it may be possible to optimize the above code in a similar manner to what you have already done.
I hope my answer helps.
Mmarss

like image 5
Mmarss Avatar answered Oct 01 '22 21:10

Mmarss


At 98% of 128-byte blocks being all zero, you're averaging less than one nonzero byte per 4K page. With an array that sparse, have you tried storing it as a sparse array? You'll save huge swathes of memory and the attendant cache-miss delays; I wouldn't be surprised if a plain std::map turns out faster.

like image 2
jthill Avatar answered Oct 01 '22 20:10

jthill