Measurement of TLB effects on a Cortex-A9

Question

After reading the following paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.

I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.

According to the documentation:

we have two micro TLBs for instruction and data cache in L1 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
The main TLB is located in L2 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)
L1' size == 32 Kbytes
L1 cache line == 32 bytes
L2' size == 1MB

I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line. I perform several read access and I measure the execution time.

typedef struct {
     int elmt; // sizeof(int) == 4 bytes
     char padding[28]; // 4 + 28 = 32B == cache line size
}entry;


volatile entry ** entries = NULL;

//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);

for(i = 0; i < NB_ENTRIES; i++)
{
      entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}

entries[LAST_ELEMENT]->elmt = -1

//Randomly access and init with random values
n = -1;
i = 0;
while(++n < NB_ENTRIES -1)
{
      //init with random value
      entries[i]->elmt = rand() % NB_ENTRIES;

      //loop till we reach the last element
      while(entries[entries[i]->elmt]->elmt != -1)
      {
            entries[i]->elmt++;
            if(entries[i]->elmt == NB_ENTRIES)
                     entries[i]->elmt = 0;
      }

       i = entries[i]->elmt;
}


gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
     j = 0;
     while(j != -1)
     {
          j = entries[j]->elmt
     }
}
gettimeofday(&tEnd, NULL);

time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);

fprintf(stdout, "%d %3lld.%02lld
", NB_ENTRIES, time / 100, time % 100);

I have an outer loop that makes NB_ENTRIES vary from 4 to 1024.

As one can see in the figure below, while NB_ENTRIES == 256 entries, executing time is longer.

When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)

Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?

Effects of TLB on execution time

EDIT 1

As it has been suggested, I ran membench (src code) and below the results:

membench results

EDIT 2

In the following paper (page 3) they ran (I suppose) the same benchmark. But the different steps are clearly visible from their plots, which is not my case.

enter image description here

Right now, according to their results and explanations, I only can identify few things.

plots confirm that L1 cache line size is 32 bytes because as they said

"once the array size exceeds the size of the data cache (32KB), the reads begin to generate misses [...] an inflection point occurs when every read generates a misse".

In my case the very first inflection point appears when stride == 32 Bytes. - The graph shows that we have a second-level (L2) cache. I think it is depicted by the yellow line (1MB == L2 size) - Therefore the two last plots above the latter probably reflects the latency while accessing Main Memory (+ TLB?).

However from this benchmark, I am not able to identify:

the cache associativity. Normally D-Cache and I-Cache are 4-way associative (Cortex-A9 TRM).
The TLB effects. As they said,

in most systems, a secondary increase in latency is indicative of the TLB, which caches a limited number of virtual to physical translations.[..] The absence of a rise in latency attributable to TLB indicates that [...]"

large page sizes have probably been used/implemented.

EDIT 3

This link explains the TLB effects from another membench graph. One can actually retrieve the same effects on my graph.

On a 4KB page system, as you grow your strides, while they're still < 4K, you'll enjoy less and less utilization of each page [...] you'll have to access the 2nd level TLB on each access [...]

The cortex-A9 supports 4KB pages mode. Indeed as one can see in my graph up to strides == 4K, latencies are increasing, then, when it reachs 4K

you suddenly start benefiting again since you're actually skipping whole pages.

Bodo Thiesen · Accepted Answer

tl;dr -> Provide a proper MVCE.

This answer should be a comment but is too big to be posted as comment, so posting as answer instead:

I had to fix a bunch of syntax errors (missing semicolons) and declare undefined variables.
After fixing all those problems, the code did NOTHING (the program quit even prior to executing the first mmap. I'm giving the tip to use curly brackets all the time, here is your first and your second error caused by NOT doing so:

.

// after calloc:
if(entries == NULL) perror("calloc failed"); exit(1);
// after mmap
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);

both lines just terminate your program regardless of the condition.

Here you got an endless loop (reformatted, added curly brackets but no other change):

.

//Randomly access and init with random values
n = -1;
i = 0;
while (++n < NB_ENTRIES -1) {
    //init with random value
    entries[i]->elmt = rand() % NB_ENTRIES;

    //loop till we reach the last element
    while (entries[entries[i]->elmt]->elmt != -1) {
        entries[i]->elmt++;
        if (entries[i]->elmt == NB_ENTRIES) {
            entries[i]->elmt = 0;
        }
    }

    i = entries[i]->elmt;
}

First iteration starts by setting entries[0]->elmt to some random value, then inner loop increments until it reaches LAST_ELEMENT. Then i is set to that value (i.e. LAST_ELEMENT) and second loop overwrites end marker -1 to some other random value. Then it's constantly incremented mod NB_ENTRIES in the inner loop until you hit CTRL+C.

Conclusion

If you want help, then post a Minimal, Complete, and Verifiable example and not something else.

Measurement of TLB effects on a Cortex-A9

Tags:

c

benchmarking

arm

embedded-linux

cortex-a

D4l3k

1 Answers

Bodo Thiesen

Recent Activity

Donate For Us

Measurement of TLB effects on a Cortex-A9

Tags:

c

benchmarking

arm

embedded-linux

cortex-a

D4l3k

1 Answers

Bodo Thiesen

Related questions

Recent Activity

Donate For Us