Prefetching Examples?

Tags:

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:

It is a simple, small, self-contained example.
Removing the __builtin_prefetch instruction results in performance degradation.
Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

850

asked Sep 07 '11 01:09

Shaun Harker

1 Answers

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.) This code performs a very large data transpose.

This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

#include <iostream> using std::cout; using std::endl;  #include <emmintrin.h> #include <malloc.h> #include <time.h> #include <string.h>  #define ENABLE_PREFETCH   #define f_vector    __m128d #define i_ptr       size_t inline void swap_block(f_vector *A,f_vector *B,i_ptr L){     //  To be super-optimized later.      f_vector *stop = A + L;      do{         f_vector tmpA = *A;         f_vector tmpB = *B;         *A++ = tmpB;         *B++ = tmpA;     }while (A < stop); } void transpose_even(f_vector *T,i_ptr block,i_ptr x){     //  Transposes T.     //  T contains x columns and x rows.     //  Each unit is of size (block * sizeof(f_vector)) bytes.      //Conditions:     //  - 0 < block     //  - 1 < x      i_ptr row_size = block * x;     i_ptr iter_size = row_size + block;      //  End of entire matrix.     f_vector *stop_T = T + row_size * x;     f_vector *end = stop_T - row_size;      //  Iterate each row.     f_vector *y_iter = T;     do{         //  Iterate each column.         f_vector *ptr_x = y_iter + block;         f_vector *ptr_y = y_iter + row_size;          do{  #ifdef ENABLE_PREFETCH             _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0); #endif              swap_block(ptr_x,ptr_y,block);              ptr_x += block;             ptr_y += row_size;         }while (ptr_y < stop_T);          y_iter += iter_size;     }while (y_iter < end); } int main(){      i_ptr dimension = 4096;     i_ptr block = 16;      i_ptr words = block * dimension * dimension;     i_ptr bytes = words * sizeof(f_vector);      cout << "bytes = " << bytes << endl; //    system("pause");      f_vector *T = (f_vector*)_mm_malloc(bytes,16);     if (T == NULL){         cout << "Memory Allocation Failure" << endl;         system("pause");         exit(1);     }     memset(T,0,bytes);      //  Perform in-place data transpose     cout << "Starting Data Transpose...   ";     clock_t start = clock();     transpose_even(T,block,dimension);     clock_t end = clock();      cout << "Done" << endl;     cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;      _mm_free(T);     system("pause"); }

When I run it with ENABLE_PREFETCH enabled, this is the output:

bytes = 4294967296 Starting Data Transpose...   Done Time: 0.725 seconds Press any key to continue . . .

When I run it with ENABLE_PREFETCH disabled, this is the output:

bytes = 4294967296 Starting Data Transpose...   Done Time: 0.822 seconds Press any key to continue . . .

So there's a 13% speedup from prefetching.

EDIT:

Here's some more results:

Operating System: Windows 7 Professional/Ultimate Compiler: Visual Studio 2010 SP1 Compile Mode: x64 Release  Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz Prefetch   : 0.868 No Prefetch: 0.960  Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz Prefetch   : 0.725 No Prefetch: 0.822  Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz Prefetch   : 0.718 No Prefetch: 0.796  2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz Prefetch   : 2.273 No Prefetch: 2.666

answered Sep 20 '22 01:09

Mysticial

Related questions
                            
                                Inherit interfaces which share a method name
                            
                                Update GCC on OSX
                            
                                Why does the = operator work on structs without having been defined?
                            
                                Pyaudio installation error - 'command 'gcc' failed with exit status 1'
                            
                                How to add a builtin function in a GCC plugin?
                            
                                When and how to use GCC's stack protection feature?
                            
                                How does this C program compile and run with two main functions?
                            
                                Why would one use #include_next in a project?
                            
                                How does the C preprocessor handle circular dependencies?
                            
                                In a GNU C macro envSet(name), what does (void) "" name mean?
                            
                                How to use AddressSanitizer with GCC?
                            
                                😃 (and other Unicode characters) in identifiers not allowed by g++
                            
                                Why does gcc not implicitly supply the -fPIC flag when compiling static libraries on x86_64
                            
                                Is it possible to use GPU acceleration on compiling multiple programs on a gcc compiler?
                            
                                Is it possible to get GCC to read from a pipe?
                            
                                Why can't GCC optimize the logical bitwise AND pair in "x && (x & 4242)" to "x & 4242"?
                            
                                What is the meaning of lines starting with a hash sign and number like '# 1 "a.c"' in the gcc preprocessor output?
                            
                                printf and long double
                            
                                How do I check if gcc is performing tail-recursion optimization?
                            
                                How do I set up CLion to compile and run?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Prefetching Examples?

Tags:

optimization

gcc

assembly

prefetch

Shaun Harker

People also ask

1 Answers

Mysticial

Recent Activity

Donate For Us