Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing ARM cache usage for different arrays

Tags:

cpu-cache

arm

I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.

like image 894
user285999 Avatar asked Mar 04 '10 06:03

user285999


1 Answers

It would be a good idea to split the cache and allocate each array in a different part of it.

Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)

However, there is a thing you can do:

The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.

The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.

The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.

like image 128
Nils Pipenbrinck Avatar answered Oct 18 '22 16:10

Nils Pipenbrinck