It's known that <code>std::vector</code> hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, <code>std::array</code> is a lightweight wrapper around a raw array and its address is equal to the first element's address. Let's assume that the sizes of collections is big enough to hold one cache line of <code>int32</code>. On my machine with 384kB L1 cache it's 98304 numbers. If I iterate the <code>std::vector</code> it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss. But if I iterate <code>std::array</code> addresses are in the same cache line so it should be faster. I tested with VS2013 with full optimization and <code>std::array</code> is approx 20% faster. Am I right in my assumptions? Update: in order to not create the second similar topic. In this code I have an array and some local variable: <pre class="prettyprint"><code>void test(array<int, 10>& arr) { int m{ 42 }; for (int i{ 0 }; i < arr.size(); ++i) { arr[i] = i * m; } } </code></pre> In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?

Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations. You are right that when you look up an element in a <code>std::vector</code>, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the <code>std::vector</code>, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a <code>std::vector</code>, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the <code>std::vector</code> internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss. Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the <code>std::vector</code> you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap. So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a <code>std::array<int, 10></code>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses <code>std::vector</code>, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array. Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method. Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on <code>std::vector</code>s whose sizes can only be known dynamically.

I think it has nothing to do with cache miss. You can take <code>std::array</code> as a wrapper of raw array, i.e. <code>int arr[10]</code>, while <code>vector</code> as a wrapper of dynamic array, i.e. <code>new int[10]</code>. They should have the same performance. However, when you access <code>vector</code>, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: <code>std::array</code> is faster. You can have a test that replacing <code>std::array</code> with <code>int arr[10]</code>. Although <code>std::array</code> is just a wrapper of <code>int arr[10]</code>, you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing <code>vector</code> with <code>new int[10]</code>, they should have equal performance. For your second question, the local variable, i.e. <code>m</code>, will be saved in register (if optimized properly), and there will be no access to the memory location of <code>m</code> during the for loop. So it won't be a problem of cache miss either.

Is every element access in std::vector a cache miss?

Tags:

c++

performance

arrays

caching

vector

It's known that std::vector hold its data on the heap so the instance of the vector itself and the first element have different addresses. On the other hand, std::array is a lightweight wrapper around a raw array and its address is equal to the first element's address.

Let's assume that the sizes of collections is big enough to hold one cache line of int32. On my machine with 384kB L1 cache it's 98304 numbers.

If I iterate the std::vector it turns out that I always access first the address of the vector itself and next access element's address. And probably this addresses are not in the same cache line. So every element access is a cache miss.

But if I iterate std::array addresses are in the same cache line so it should be faster.

I tested with VS2013 with full optimization and std::array is approx 20% faster.

Am I right in my assumptions?

Update: in order to not create the second similar topic. In this code I have an array and some local variable:

void test(array<int, 10>& arr)
{
    int m{ 42 };

    for (int i{ 0 }; i < arr.size(); ++i)
    {
        arr[i] = i * m;
    }
}

In the loop I'm accessing both an array and a stack variable which are placed far from each other in memory. Does that mean that every iteration I'll access different memory and miss the cache?

597

asked Jul 21 '16 09:07

nikitablack

2 Answers

Many of the things you've said are correct, but I do not believe that you're seeing cache misses at the rate that you believe you are. Rather, I think you're seeing other effects of compiler optimizations.

You are right that when you look up an element in a std::vector, that there are two memory reads: first, a memory read for the pointer to the elements; second, a memory read for the element itself. However, if you do multiple sequential reads on the std::vector, then chances are that the very first read you do will have a cache miss on the elements, but all successive reads will either be in cache or be unavoidable. Memory caches are optimized for locality of reference, so whenever a single address is pulled into cache a large number of adjacent memory addresses are pulled into the cache as well. As a result, if you iterate over the elements of a std::vector, most of the time you won't have any cache misses at all. The performance should look quite similar to that for a regular array. It's also worth remembering that the cache stores multiple different memory locations, not just one, so the fact that you're reading both something on the stack (the std::vector internal pointer) and something in the heap (the elements), or two different elements on the stack, won't immediately cause a cache miss.

Something to keep in mind is that cache misses are extremely expensive compared to cache hits - often 10x slower - so if you were indeed seeing a cache miss on each element of the std::vector you wouldn't see a gap of only 20% in performance. You'd see something a lot closer to a 2x or greater performance gap.

So why, then, are you seeing a difference in performance? One big factor that you haven't yet accounted for is that if you use a std::array<int, 10>, then the compiler can tell at compile-time that the array has exactly ten elements in it and can unroll or otherwise optimize the loop you have to eliminate unnecessary checks. In fact, the compiler could in principle replace the loop with 10 sequential blocks of code that all write to a specific array element, which might be a lot faster than repeatedly branching backwards in the loop. On the other hand, with equivalent code that uses std::vector, the compiler can't always know in advance how many times the loop will run, so chances are it can't generate code that's as good as the code it generated for the array.

Then there's the fact that the code you've written here is so small that any attempt to time it is going to have a ton of noise. It would be difficult to assess how fast this is reliably, since something as simple as just putting it into a for loop would mess up the cache behavior compared to a "cold" run of the method.

Overall, I wouldn't attribute this to cache misses, since I doubt there's any appreciably different number of them. Rather, I think this is compiler optimization on arrays whose sizes are known statically compared with optimization on std::vectors whose sizes can only be known dynamically.

132

answered Oct 31 '22 04:10

templatetypedef

I think it has nothing to do with cache miss.

You can take std::array as a wrapper of raw array, i.e. int arr[10], while vector as a wrapper of dynamic array, i.e. new int[10]. They should have the same performance. However, when you access vector, you operate on the dynamic array through pointers. Normally the compiler might optimize code with array better than code with pointers. And that might be the reason you get the test result: std::array is faster.

You can have a test that replacing std::array with int arr[10]. Although std::array is just a wrapper of int arr[10], you might get even better performance (in some case, the compiler can do better optimization with raw array). You can also have another test that replacing vector with new int[10], they should have equal performance.

For your second question, the local variable, i.e. m, will be saved in register (if optimized properly), and there will be no access to the memory location of m during the for loop. So it won't be a problem of cache miss either.

answered Oct 31 '22 04:10

for_stack

Related questions
                            
                                Is it OK to reference an out-of-scope local variable within the same function?
                            
                                Fastest way to render a tiled map with SDL2 [closed]
                            
                                Is boost::signals2 overkill for simple applications?
                            
                                How is a variable at the same address producing 2 different values? [duplicate]
                            
                                Memory Allocation for Recursive Functions
                            
                                What is MAKEWORD used for?
                            
                                Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)
                            
                                The difference between while and do while C++? [duplicate]
                            
                                Why does C++ show characters when we print the pointer to a character type? [duplicate]
                            
                                How can I trust casting from double to integer?
                            
                                Bitwise operation result and booleans
                            
                                What is the equivalent of QVariant in C++?
                            
                                Does std::move on std::string garantee that .c_str() returns same result?
                            
                                Why does casting a char array to an int pointer and writing to it using the pointer make the data reversed?
                            
                                Is virtual table per object or per class? [duplicate]
                            
                                How to write a template function that takes an array and an int specifying array size
                            
                                openCV 3.0.0 cv::vector missing
                            
                                Understanding the List Operator (%) in Boost.Spirit
                            
                                Are C++ references guaranteed to use pointers "internally"?
                            
                                How do I manually delete an instance of a class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With