Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do memory access times increase when far over CPU cache sizes

In looking at performance issues involving large numbers of accesses outside of CPU cache sizes I made a test which "randomly" times memory accesses in increasing block sizes. I see the expected changes from L1,2,3 Cache block sizes but was surprised to see access time continue to decrease far beyond the cache capability.

For instance there was a halving in access times from thrashing a 256MB block to a 4GB block. From 50 read/writes per uS to 25 read/writes per uS. The decrease continues up to the system memory limit. I left 8GB (or 4GB) extra for other apps and OS.

The L3 cache is 8MB so I would have expected very little cache influence for the larger block sizes.

The algorithm uses primitive polynomials to "randomly" address each 64 bit word. This effectively accesses addresses in a fairly random way but assures that all addresses, except the 0 index, are accessed exactly once per pass. After a sufficient number of passes so that each one takes a second or so, the results are tabulated.

I'm at a loss to explain this continued access time decrease far beyond cache limits. Any explanations?

Here's the results from 3 different Windows 10 machines:

        | Memory block (bytes)
        |         | 64 bit words incremented per us
-- desktop I7 980 24GB --     -- Surface Book 16GB --      --HP Envy 8GB --
      128    544.80              128    948.43              128    774.22
      256    554.01              256   1034.15              256    715.50
      512    560.12              512    993.28              512    665.23
    1.02k    512.93            1.02k    944.24            1.02k    665.19
    2.05k    527.47            2.05k    947.09            2.05k    664.84
    4.10k    517.41            4.10k    931.48            4.10k    664.94
    8.19k    517.55            8.19k    939.61            8.19k    666.40
   16.38k    518.30           16.38k    941.18           16.38k    666.88
   32.77k    518.10           32.77k    938.77           32.77k    663.33
   65.54k    505.93           65.54k    889.42           65.54k    645.61
  131.07k    501.91          131.07k    855.01          131.07k    577.49
  262.14k    495.61          262.14k    882.75          262.14k    507.57
  524.29k    356.98          524.29k    774.23          524.29k    445.47
    1.05m    281.87            1.05m    695.35            1.05m    417.13
    2.10m    240.41            2.10m    650.26            2.10m    366.45
    4.19m    210.10            4.19m    229.06            4.19m    129.21
    8.39m    158.72            8.39m    114.95            8.39m     77.27
   16.78m     99.08           16.78m     84.95           16.78m     62.47
   33.55m     79.12           33.55m     60.14           33.55m     54.94
   67.11m     68.22           67.11m     34.56           67.11m     49.89
  134.22m     56.17          134.22m     22.52          134.22m     39.66
  268.44m     50.03          268.44m     23.81          268.44m     35.16
  536.87m     46.24          536.87m     39.66          536.87m     32.50
 1073.74m     43.29         1073.74m     30.33         1073.74m     25.28
 2147.48m     33.33         2147.48m     25.19         2147.48m     15.94
 4294.97m     24.85         4294.97m     10.83         4294.97m     13.18
 8589.93m     19.96         8589.93m      9.61
17179.87m     17.05

Here's the c++ code:

// Memory access times for randomly distributed read/writes

#include <iostream>
#include <cstdio>
#include <algorithm>
#include <chrono>
#include <array>

using namespace std;

// primitive polynomials over gf(2^N)
// these form simple shift registers that cycle through all possible numbers in 2^N except for 0
const array<uint32_t, 28> gf = {
    0x13, 0x25, 0x67, 0xcb,                        0x1cf, 0x233, 0x64f, 0xbb7,
    0x130f, 0x357f, 0x4f9f, 0x9e47,                0x11b2b, 0x2df4f, 0x472f3, 0xdf6af,
    0x16b04f, 0x2e0fd5, 0x611fa7, 0xa81be1,        0x11f21c7, 0x202d219, 0x67833df, 0xbc08c6b,
    0x123b83c7, 0x2dbf7ea3, 0x6268545f, 0xe6fc6257
};

int main()
{
    typedef uint64_t TestType;
    printf("        | Memory block (bytes)\n        |         | %d bit words incremented per us\n", 8 * (int)sizeof(TestType));
    TestType *const memory = new TestType[0x8000'0000u];
    for (int N = 4; N < 32-0; N++)
    {
        const uint32_t gfx = gf[N - 4];
        const uint32_t seg_size = 1 << N;
        int repCount=1+static_cast<int>(gf[25]/(static_cast<float>(seg_size)));
        fill(&memory[1], &memory[seg_size], 0);
        chrono::high_resolution_clock::time_point timerx(chrono::high_resolution_clock::now());
        for (int rep = 0; rep < repCount; rep++)
        {
            uint32_t start = 1;
            for (uint32_t i = 0; i < seg_size - 1; i++) { // cycles from 1 back to 1 includes all values except 0
                ++memory[start];
                start <<= 1;
                if (start & seg_size)
                    start ^= gfx;
            }
            if (start != 1)
            {
                cout << "ERROR\n";
                exit(-1);
            }
        }
        auto time_done = chrono::duration<double>(chrono::high_resolution_clock::now()-timerx).count();
        auto x = find_if_not(&memory[1], &memory[seg_size], [repCount](auto v) {return v == static_cast<TestType>(repCount); });
        if (x != &memory[seg_size])
        {
            printf("Failed at memory offset %lld\n", x - &memory[0]);
            return -1;
        }
        long long int blksize = 4ll << N;
        if ((sizeof(TestType) << N) < 1000)
            printf("%9.0f    %6.2f\n", 1.0*(sizeof(TestType) << N), (seg_size - 1)*repCount / (time_done * 1'000'000));
        else if ((sizeof(TestType) << N) < 1000'000)
            printf("%8.2fk    %6.2f\n", .001*(sizeof(TestType) << N), (seg_size - 1)*repCount / (time_done * 1'000'000));
        else
            printf("%8.2fm    %6.2f\n", .000001*((long long int)sizeof(TestType) << N), (seg_size - 1.)*repCount /(time_done * 1'000'000));
    }
    cout << "Done\n";
    return 0;
}
like image 428
doug Avatar asked Feb 04 '19 05:02

doug


People also ask

What happens if cache size exceeds memory size?

So if the size of cache increased upto 1gb or more it will not stay as cache, it becomes RAM. Data is stored in ram temporary. So if cache isn't used, when data is called by processor, ram will take time to fetch data to provide to the processor because of its wide size of 4gb or more.

Why systems with extremely large cache sizes could actually become slower?

The smaller capacity of the Cache reduces the time required to locate data within it and provide it to the computer for processing. If the Cache size will be bigger then the CPU seek time will be increase to find out the desirable address. Thus the processing speed will be slow down.

Why does having a larger cache speed up processing time?

The more cache there is, the more data can be stored closer to the CPU. Cache memory is beneficial because: Cache memory holds frequently used instructions/data which the processor may require next and it is faster access memory than RAM, since it is on the same chip as the processor.

Why is cache memory faster than main memory?

Cache memory is faster than main memory. It consumes less access time as compared to main memory. It stores the program that can be executed within a short period of time. It stores data for temporary use.


1 Answers

The throughput continues to decrease because the page walking time increases per element, as the total number of elements increase. That is, the amount of time spent on filling the TLB does not scale with the number of elements. You can observe this using the DTLB_LOAD_MISSES.WALK_DURATION performance counter and other counters related to the page walking hardware. This is expected because when the number of 4K pages accessed increase, the depth and the breadth of the page table that map the working set also gets larger, and hence it is less likely to find the required page table entries at memory levels closer to the core.

like image 100
Hadi Brais Avatar answered Oct 19 '22 09:10

Hadi Brais