Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is every element access in std::vector a cache miss?

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.

Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.

If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.

But if I iterate std::array addresses are in the same cache line so it should be faster.

I tested with VS2013 with full optimization and std::array is approx 20% faster.

Am I right in my assumptions?

Update: in order to not create the second similar topic. In this code I have an array and some local variable:

void test(array<int, 10>& arr)
{
    int m{ 42 };

    for (int i{ 0 }; i < arr.size(); ++i)
    {
        arr[i] = i * m;
    }
}

In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?

like image 597
nikitablack Avatar asked Jul 21 '16 09:07

nikitablack


People also ask

Are vectors in C++ cache friendly?

Writing Cache-Friendly C++ Because std::vector<T> is cache-friendly. This talk will quickly explain what it means, why it is so important, and how to write cache-friendly code yourself.

Are vectors cache friendly?

Rust's vectors are as cache friendly as it gets, really - the entire vector is a single contiguous chunk of memory.

Do vectors have random access?

By default, you should use a vector. It has the simplest internal data structure and provides random access. Thus, data access is convenient and flexible, and data processing is often fast enough.

Are vectors stored contiguously?

Just like arrays, vectors use contiguous storage locations for their elements, which means that their elements can also be accessed using offsets on regular pointers to its elements, and just as efficiently as in arrays.


2 Answers

Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.

You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.

Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.

So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.

Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.

Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.

like image 132
templatetypedef Avatar answered Oct 31 '22 04:10

templatetypedef


I think it has nothing to do with cache miss.

You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.

You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.

For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.

like image 26
for_stack Avatar answered Oct 31 '22 04:10

for_stack