How to defeat hardware prefetcher in core i3/i7 in linux

Tags:

I am trying to find a way to defeat the H/w prefetcher to detect the stream pattern and access 4KB data in a random order so that it is not detected and prefetched by H/w prefetcher.

Initially I was thinking to access all even index data in a random pattern as H/w prefetcher prefetch the next cache lines always (so when I access even index, next odd index data is already prefetched).

I wrote the code to access all even index data in a random pattern, however the results indicate that the prefetcher detected the pattern (don't know how ? There is no fixed stride, all are random stride )

I was investigating the reason-why this happened, then I found this article in Intel ; https://software.intel.com/en-us/forums/topic/473493

According to John D. McCalpin, PhD, "Dr. Bandwidth,

In section 2.2.5.4 of "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-028, July 2013),it states that,

streamer prefetcher "[d]etects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained.

This implies that the L2 hardware prefetcher tracks the 16 4KiB pages most recently accessed and remembers enough of the access patterns for those pages to track one forward stream and one backward stream. So to defeat the L2 streamer prefetcher with "random" fetches, simply ensure that you access more than 15 other 4 KiB pages before you make a second reference to a previously referenced page. So a "random" sequences of fetches might be composed of a random permutation of more than 16 4 KiB page numbers with a random offset within each page. (I typically use at least 32 pages in my permutation list.)

So it means in between accesses of two different random indexes of same 4KB pages we need to access atleast 16 4KB pages to defeat H/w prefetcher.

I have implemented the concept suggested by John D. McCalpin , however the results again show the h/w prefetcher is not defeated. It is able to detect some pattern and prefetch data (see sample output) . I have varied number of accessed pages from 20-40 4KB pages , but no improvement/change in result.

Here is my code :

#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>

#ifndef _POSIX_THREAD_PROCESS_SHARED
#error This system does not support process shared mutex
#endif

#define MAX_COUNT 3000
#define INDEX (40*1024) // size of DUMMY 40 4KB pages

inline void clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

unsigned long probe(char *adrs) {
  volatile unsigned long time;
  asm __volatile__ (
    " mfence              \n"
    " lfence              \n"
    " rdtsc               \n"
    " lfence              \n"
    " movl %%eax, %%esi \n"
    " movl (%1), %%eax     \n"
    " lfence              \n"
    " rdtsc               \n"
    " subl %%esi, %%eax \n"
    " clflush 0(%1)       \n"
    : "=a" (time)
    : "c" (adrs)
    : "%esi", "%edx");
  return time;
}

void shuffle(int *arr, size_t n)
{
    if (n > 1) 
    {
        size_t i;
        srand(time(NULL));
        for (i = 0; i < n - 1; i++) 
        {
          size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
          int t = arr[j];
          arr[j] = arr[i];
          arr[i] = t;
        }
    }
}


static const int DATA[1024]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023};

int main(int argc, char *argv[])
{

    int counter=0,k=0;
    unsigned long Access_Time[MAX_COUNT][64]={0};   
    int DUMMY[INDEX];// dummy array of 40 * 4KB ;  

    //Initialize
    for(k=0;k<INDEX;k++)
        DUMMY[k]=k;

    //access it to check segmentation fault is happening or not
    for(k=0;k<INDEX;k++)
        DUMMY[k]+=k;

    // even index in random order
    int index[32]={4,8,16,32,54,34,62,50,26,52,30,60,46,18,36,58,42,10,20,40,6,12,24,48,22,44,14,28,56,38,2,0};

    int TOTAL_RANDOM_PAGE=40;

    int i,PAGE[TOTAL_RANDOM_PAGE]; // PAGE will contain page no of 40 pages which will be accessed in random order to defeat prefetcher
        for (i=0; i<TOTAL_RANDOM_PAGE; i++)
    {
            PAGE[i] = i;
        }

    shuffle(PAGE, TOTAL_RANDOM_PAGE); // PAGE now have page no in random order

    FILE *fp2;
    int s,s1;
    int random_index=0,sum=0;

    const int *p0=&DATA[0];
    for (s=0;s<64;s++)
    {
        clflush((void *)(p0+s*16));
    }

    while(counter<MAX_COUNT)
    {               
        // Find Access time for Even Index
        for (s=0;s<32;s++)
        {

            // Access a random index
                Access_Time[counter][index[s]]=probe((char *)(p0+16*index[s]));

            //Now, access 40 other indexes belong to other 40 4KB page      
            shuffle(PAGE, TOTAL_RANDOM_PAGE); // random orderpage
            for(random_index=0;random_index<TOTAL_RANDOM_PAGE;random_index++)
            {
            DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]]=2*DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]];
            }

        }// end of for loop     

        // Flush all DATA from cache        
        for (s1=0;s1<64;s1++)
        {
            clflush((void *)(p0+s1*16));
        }
     counter++;

    }// end of while loop

    fp2=fopen("All_access_time.txt","a");

    int index4;
    for(counter=0;counter<MAX_COUNT;counter++)
    {
        for (index4=0;index4<64;index4++)
        {
            if(Access_Time[counter][index4]>0 && Access_Time[counter][index4]<200)
            fprintf(fp2,"%d,%d,%lu\n",counter,index4,Access_Time[counter][index4]);             
        }
    }

return 1;
}

Another interesting observation is , the access time of random indexes which were prefetched has access time around 35-70 ticks. (see sample output)

In my system, the L1 access time 36-44 ticks, L2 access time 50-70 ticks, L3 access time = 90-120 ticks.

Experiments were done on both Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz and Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, however results are similar.

Few internal details of system,

L1-D = 32KB,  ways_of_associative=8
L1-I = 32KB,  ways_of_associative=8
L2   = 256KB, ways_of_associative=8
L3   = 3072KB (Core-i3), ways_of_associative=12
L3   = 8192KB (Core-i7), ways_of_associative=16
Cache line size=64Bytes

Can you please help me to understand why H/W prefetcher able to detect my random pattern ? Where am I making mistakes?

How to do the coding so that I can defeat the prefetcher and h/w prefetcher unable to prefetch my data ?

NOTE: I have disabled s/w prefetcher optimization while compiling using -O0 option with gcc.

sample output :

(counter,index,access_time)
30,8,56
30,18,72
30,20,52
30,28,72
30,34,72
30,36,72
30,38,72
30,40,72
30,42,72
31,8,52
31,18,56
31,20,52
31,28,72
31,34,52
31,36,72
31,38,56
31,40,72
31,42,52
31,60,56
32,8,52
32,18,72
32,20,52
32,28,52
32,34,72
32,36,52
32,38,72
32,40,52
32,42,52
32,48,52
33,8,56
33,18,72
33,20,52
33,28,72
33,34,52
33,36,72
33,38,72
33,40,52
33,42,72
34,8,72
34,18,52
34,20,72
34,28,72
34,34,72
34,36,52
34,38,76
34,40,72
34,42,76
34,60,72

526

asked Aug 10 '15 22:08

bholanath

1 Answers

If you are brave enough to write a kernel module you can do what you want.

As almost all features of the Core CPUs the hardware prefetching logic can be disabled for debugging purposes.

Hardware prefetching is controlled by the Model Specific Register IA32_MISC_ENABLE (0x1a0). Just set bit 9 of this register, and the prefetcher goes off.

For more information please check the "Intel® 64 and IA-32 Architectures Software Developer’s Manual". A search for IA32_MISC_ENABLE will bring you to the correct chapter.

Also a search on the Linux kernel source for the same keyword gives a few hits. They aren't related to prefetching but for a different thing, but the code looks like a good boilerplate as it shows how to read and write the IA32_MISC_ENABLE register from the kernel.

If you go this way, double and triple check what you're doing. You don't want to accidently disable the thermal monitors. They are located in MISC_ENABLE as well :-)

128

answered Sep 22 '22 16:09

Nils Pipenbrinck

Related questions
                            
                                Implementing cancellable syscalls in userspace
                            
                                efficient 2d mean filter implementation that minimises redundant memory loads?
                            
                                Where to find a comparison of Flex/Bison, Ragel, ANTLR and others?
                            
                                How to measure ARM performance?
                            
                                Floating Point Exception Core Dump
                            
                                Why i am not getting the expected output in the following c programme? [duplicate]
                            
                                Adding self-signed SSL certificate for libcurl
                            
                                Writing to multiple file-descriptors
                            
                                libuv: how to gracefully exit application on an error?
                            
                                Why do system calls return EFAULT instead of sending a segfault?
                            
                                How thread-safe is V4L2?
                            
                                How to put 2 sections in 1 segment (Using ld scripts)
                            
                                Boomerang: Unable to load libQtGUI
                            
                                One-instance only application in C and Linux [duplicate]
                            
                                Good sentinel value for double if prefer to use -ffast-math
                            
                                CMake - Automatically Parsing Dependencies of Precompiled Header?
                            
                                how to print float value upto 2 decimal place without rounding off [duplicate]
                            
                                The concept of a type name scope in C11
                            
                                How to create stub shared libraries on Linux
                            
                                How to use strerror_l with current locale?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to defeat hardware prefetcher in core i3/i7 in linux

Tags:

c

linux

optimization

assembly

bholanath

People also ask

1 Answers

Nils Pipenbrinck

Recent Activity

Donate For Us