Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to defeat hardware prefetcher in core i3/i7 in linux

I am trying to find a way to defeat the H/w prefetcher to detect the stream pattern and access 4KB data in a random order so that it is not detected and prefetched by H/w prefetcher.

Initially I was thinking to access all even index data in a random pattern as H/w prefetcher prefetch the next cache lines always (so when I access even index, next odd index data is already prefetched).

I wrote the code to access all even index data in a random pattern, however the results indicate that the prefetcher detected the pattern (don't know how ? There is no fixed stride, all are random stride )

I was investigating the reason-why this happened, then I found this article in Intel ; https://software.intel.com/en-us/forums/topic/473493

According to John D. McCalpin, PhD, "Dr. Bandwidth,

In section 2.2.5.4 of "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-028, July 2013),it states that,

streamer prefetcher "[d]etects and maintains up to 32 streams of data accesses. For each 4K byte page, you can maintain one forward and one backward stream can be maintained.

This implies that the L2 hardware prefetcher tracks the 16 4KiB pages most recently accessed and remembers enough of the access patterns for those pages to track one forward stream and one backward stream. So to defeat the L2 streamer prefetcher with "random" fetches, simply ensure that you access more than 15 other 4 KiB pages before you make a second reference to a previously referenced page. So a "random" sequences of fetches might be composed of a random permutation of more than 16 4 KiB page numbers with a random offset within each page. (I typically use at least 32 pages in my permutation list.)

So it means in between accesses of two different random indexes of same 4KB pages we need to access atleast 16 4KB pages to defeat H/w prefetcher.

I have implemented the concept suggested by John D. McCalpin , however the results again show the h/w prefetcher is not defeated. It is able to detect some pattern and prefetch data (see sample output) . I have varied number of accessed pages from 20-40 4KB pages , but no improvement/change in result.

Here is my code :

#define _GNU_SOURCE             /* See feature_test_macros(7) */
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sched.h>

#ifndef _POSIX_THREAD_PROCESS_SHARED
#error This system does not support process shared mutex
#endif

#define MAX_COUNT 3000
#define INDEX (40*1024) // size of DUMMY 40 4KB pages

inline void clflush(volatile void *p)
{
    asm volatile ("clflush (%0)" :: "r"(p));
}

unsigned long probe(char *adrs) {
  volatile unsigned long time;
  asm __volatile__ (
    " mfence              \n"
    " lfence              \n"
    " rdtsc               \n"
    " lfence              \n"
    " movl %%eax, %%esi \n"
    " movl (%1), %%eax     \n"
    " lfence              \n"
    " rdtsc               \n"
    " subl %%esi, %%eax \n"
    " clflush 0(%1)       \n"
    : "=a" (time)
    : "c" (adrs)
    : "%esi", "%edx");
  return time;
}

void shuffle(int *arr, size_t n)
{
    if (n > 1) 
    {
        size_t i;
        srand(time(NULL));
        for (i = 0; i < n - 1; i++) 
        {
          size_t j = i + rand() / (RAND_MAX / (n - i) + 1);
          int t = arr[j];
          arr[j] = arr[i];
          arr[i] = t;
        }
    }
}


static const int DATA[1024]={0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012,1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023};

int main(int argc, char *argv[])
{

    int counter=0,k=0;
    unsigned long Access_Time[MAX_COUNT][64]={0};   
    int DUMMY[INDEX];// dummy array of 40 * 4KB ;  

    //Initialize
    for(k=0;k<INDEX;k++)
        DUMMY[k]=k;

    //access it to check segmentation fault is happening or not
    for(k=0;k<INDEX;k++)
        DUMMY[k]+=k;

    // even index in random order
    int index[32]={4,8,16,32,54,34,62,50,26,52,30,60,46,18,36,58,42,10,20,40,6,12,24,48,22,44,14,28,56,38,2,0};

    int TOTAL_RANDOM_PAGE=40;

    int i,PAGE[TOTAL_RANDOM_PAGE]; // PAGE will contain page no of 40 pages which will be accessed in random order to defeat prefetcher
        for (i=0; i<TOTAL_RANDOM_PAGE; i++)
    {
            PAGE[i] = i;
        }

    shuffle(PAGE, TOTAL_RANDOM_PAGE); // PAGE now have page no in random order

    FILE *fp2;
    int s,s1;
    int random_index=0,sum=0;

    const int *p0=&DATA[0];
    for (s=0;s<64;s++)
    {
        clflush((void *)(p0+s*16));
    }

    while(counter<MAX_COUNT)
    {               
        // Find Access time for Even Index
        for (s=0;s<32;s++)
        {

            // Access a random index
                Access_Time[counter][index[s]]=probe((char *)(p0+16*index[s]));

            //Now, access 40 other indexes belong to other 40 4KB page      
            shuffle(PAGE, TOTAL_RANDOM_PAGE); // random orderpage
            for(random_index=0;random_index<TOTAL_RANDOM_PAGE;random_index++)
            {
            DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]]=2*DUMMY[1024*PAGE[random_index]+16*PAGE[random_index]];
            }

        }// end of for loop     

        // Flush all DATA from cache        
        for (s1=0;s1<64;s1++)
        {
            clflush((void *)(p0+s1*16));
        }
     counter++;

    }// end of while loop

    fp2=fopen("All_access_time.txt","a");

    int index4;
    for(counter=0;counter<MAX_COUNT;counter++)
    {
        for (index4=0;index4<64;index4++)
        {
            if(Access_Time[counter][index4]>0 && Access_Time[counter][index4]<200)
            fprintf(fp2,"%d,%d,%lu\n",counter,index4,Access_Time[counter][index4]);             
        }
    }

return 1;
}

Another interesting observation is , the access time of random indexes which were prefetched has access time around 35-70 ticks. (see sample output)

In my system, the L1 access time 36-44 ticks, L2 access time 50-70 ticks, L3 access time = 90-120 ticks.

Experiments were done on both Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz and Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, however results are similar.

Few internal details of system,

L1-D = 32KB,  ways_of_associative=8
L1-I = 32KB,  ways_of_associative=8
L2   = 256KB, ways_of_associative=8
L3   = 3072KB (Core-i3), ways_of_associative=12
L3   = 8192KB (Core-i7), ways_of_associative=16
Cache line size=64Bytes

Can you please help me to understand why H/W prefetcher able to detect my random pattern ? Where am I making mistakes?

How to do the coding so that I can defeat the prefetcher and h/w prefetcher unable to prefetch my data ?

NOTE: I have disabled s/w prefetcher optimization while compiling using -O0 option with gcc.

sample output :

(counter,index,access_time)
30,8,56
30,18,72
30,20,52
30,28,72
30,34,72
30,36,72
30,38,72
30,40,72
30,42,72
31,8,52
31,18,56
31,20,52
31,28,72
31,34,52
31,36,72
31,38,56
31,40,72
31,42,52
31,60,56
32,8,52
32,18,72
32,20,52
32,28,52
32,34,72
32,36,52
32,38,72
32,40,52
32,42,52
32,48,52
33,8,56
33,18,72
33,20,52
33,28,72
33,34,52
33,36,72
33,38,72
33,40,52
33,42,72
34,8,72
34,18,52
34,20,72
34,28,72
34,34,72
34,36,52
34,38,76
34,40,72
34,42,76
34,60,72
like image 526
bholanath Avatar asked Aug 10 '15 22:08

bholanath


People also ask

Should I disable hardware prefetcher?

Hardware prefetching is typically enabled by default. If you are developing a pattern type or hypervisor image that provides WebSphere runtimes but does not extend the WebSphere hypervisor or Web Application pattern, you might choose to also disable hardware prefetching to see similar performance improvements.

Should I disable hardware prefetcher in BIOS?

Hardware prefetchers work well in workloads that traverse array and other regular data structures. The hardware prefetcher options are disabled by default and should be disabled when running applications that perform aggressive software prefetching or for workloads with limited cache.


1 Answers

If you are brave enough to write a kernel module you can do what you want.

As almost all features of the Core CPUs the hardware prefetching logic can be disabled for debugging purposes.

Hardware prefetching is controlled by the Model Specific Register IA32_MISC_ENABLE (0x1a0). Just set bit 9 of this register, and the prefetcher goes off.

For more information please check the "Intel® 64 and IA-32 Architectures Software Developer’s Manual". A search for IA32_MISC_ENABLE will bring you to the correct chapter.

Also a search on the Linux kernel source for the same keyword gives a few hits. They aren't related to prefetching but for a different thing, but the code looks like a good boilerplate as it shows how to read and write the IA32_MISC_ENABLE register from the kernel.

If you go this way, double and triple check what you're doing. You don't want to accidently disable the thermal monitors. They are located in MISC_ENABLE as well :-)

like image 128
Nils Pipenbrinck Avatar answered Sep 22 '22 16:09

Nils Pipenbrinck