Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prefetching Examples?

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:

  1. It is a simple, small, self-contained example.
  2. Removing the __builtin_prefetch instruction results in performance degradation.
  3. Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

like image 850
Shaun Harker Avatar asked Sep 07 '11 01:09

Shaun Harker


People also ask

How do you use prefetching?

Prefetching exampleWhen the user enters their search query, the search engine goes ahead and delivers the results to the user. Based on which results a user typically visits (usually the first or second) the resources for these pages are then prefetched resulting in a faster loading time if the user clicks the link.

What is meant by prefetching?

Prefetching in computer science is a technique for speeding up fetch operations by beginning a fetch operation whose result is expected to be needed soon. Usually this is before it is known to be needed, so there is a risk of wasting time by prefetching data that will not be used.

Can prefetching hurt performance?

Unfortunately, aggressive prefetching significantly reduces performance on some benchmarks. For example, an aggressive prefetcher reduces the IPC performance of ammp by 48% and applu by 29% compared to no prefetch- ing.

How do you prefetch resources?

Prefetched files are stored in the HTTP Cache, or the memory cache (depending on whether the resource is cacheable or not), for an amount of time that varies by browsers. For example, in Chrome resources are kept around for five minutes, after which the normal cache-control rules for the resource apply.


1 Answers

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.) This code performs a very large data transpose.

This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

#include <iostream> using std::cout; using std::endl;  #include <emmintrin.h> #include <malloc.h> #include <time.h> #include <string.h>  #define ENABLE_PREFETCH   #define f_vector    __m128d #define i_ptr       size_t inline void swap_block(f_vector *A,f_vector *B,i_ptr L){     //  To be super-optimized later.      f_vector *stop = A + L;      do{         f_vector tmpA = *A;         f_vector tmpB = *B;         *A++ = tmpB;         *B++ = tmpA;     }while (A < stop); } void transpose_even(f_vector *T,i_ptr block,i_ptr x){     //  Transposes T.     //  T contains x columns and x rows.     //  Each unit is of size (block * sizeof(f_vector)) bytes.      //Conditions:     //  - 0 < block     //  - 1 < x      i_ptr row_size = block * x;     i_ptr iter_size = row_size + block;      //  End of entire matrix.     f_vector *stop_T = T + row_size * x;     f_vector *end = stop_T - row_size;      //  Iterate each row.     f_vector *y_iter = T;     do{         //  Iterate each column.         f_vector *ptr_x = y_iter + block;         f_vector *ptr_y = y_iter + row_size;          do{  #ifdef ENABLE_PREFETCH             _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0); #endif              swap_block(ptr_x,ptr_y,block);              ptr_x += block;             ptr_y += row_size;         }while (ptr_y < stop_T);          y_iter += iter_size;     }while (y_iter < end); } int main(){      i_ptr dimension = 4096;     i_ptr block = 16;      i_ptr words = block * dimension * dimension;     i_ptr bytes = words * sizeof(f_vector);      cout << "bytes = " << bytes << endl; //    system("pause");      f_vector *T = (f_vector*)_mm_malloc(bytes,16);     if (T == NULL){         cout << "Memory Allocation Failure" << endl;         system("pause");         exit(1);     }     memset(T,0,bytes);      //  Perform in-place data transpose     cout << "Starting Data Transpose...   ";     clock_t start = clock();     transpose_even(T,block,dimension);     clock_t end = clock();      cout << "Done" << endl;     cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;      _mm_free(T);     system("pause"); } 

When I run it with ENABLE_PREFETCH enabled, this is the output:

bytes = 4294967296 Starting Data Transpose...   Done Time: 0.725 seconds Press any key to continue . . . 

When I run it with ENABLE_PREFETCH disabled, this is the output:

bytes = 4294967296 Starting Data Transpose...   Done Time: 0.822 seconds Press any key to continue . . . 

So there's a 13% speedup from prefetching.

EDIT:

Here's some more results:

Operating System: Windows 7 Professional/Ultimate Compiler: Visual Studio 2010 SP1 Compile Mode: x64 Release  Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz Prefetch   : 0.868 No Prefetch: 0.960  Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz Prefetch   : 0.725 No Prefetch: 0.822  Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz Prefetch   : 0.718 No Prefetch: 0.796  2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz Prefetch   : 2.273 No Prefetch: 2.666 
like image 95
Mysticial Avatar answered Sep 20 '22 01:09

Mysticial