How to put my structure variable into CPU caches to eliminate main memory page access time? Options

Tags:

It's clear that there is no explicit way or certain system calls that help programmers to put a variable into the CPU cache.

But I think that a certain programming style or well designed algorithm can make it possible to increase the possibilities that the variable can be cached into the CPU caches.

Here is my example:

I want to append an 8 byte structure at the end of an array consisting of the same type of structures, declared in the global main memory region.

This process is continuously repeated for 4 million operations. This process takes 6 seconds, 1.5 us for each operation. I think this result tells that the two memory areas have not been cached.

I got some clues from a cache-oblivious algorithm, so I tried several ways to enhance this. Until now, no enhancement.

I think some clever codes can reduce the elapsed time, up to 10 to 100 times. Please show me the way.

-------------------------------------------------------------------------

Appended (2011-04-01)

Damon~ thank you for your comment!

After reading your comment, I analyzed my code again, and found several things that I missed. The following code that I attached is the abbreviated version of my original code.

To accurately measure each operation's execution time (in the original code, there are several different types of operations), I inserted the time measuring code using clock_gettime() function. I thought if I measure each operation's execution time and accumulate them, the additional cost by the main loop can be avoided.

In the original code, the time measuring code was hidden by a macro function, so I totally forgot about it.

The running time of this code is almost 6 seconds. But if I get rid of the time measuring function in the main loop, it becomes 0.1 seconds.

Since the clock_gettime() function supports very high precision (upto 1 nano second), executed on the basis of an independent thread, and also it requires very big structure, I think the function caused the cache-out of the main memory area where the consecutive insertions are performed.

Thank you again for your comment. For further enhancement, any suggestion will be very helpful for me to optimize my code.

I think the hierachically defined structure variable might cause unnecessary time cost, but first I want to know how much it would be, before I change it to the more C-style code.

typedef struct t_ptr {
    uint32 isleaf :1, isNextLeaf :1, ptr :30;
    t_ptr(void) {
        isleaf = false;
        isNextLeaf = false;
        ptr = NIL;
    }
} PTR;

typedef struct t_key {
    uint32 op :1, key :31;
    t_key(void) {
        op = OP_INS;
        key = 0;
    }
} KEY;

typedef struct t_key_pair {
    KEY key;
    PTR ptr;
    t_key_pair() {
    }

    t_key_pair(KEY k, PTR p) {
        key = k;
        ptr = p;
    }
} KeyPair;

typedef struct t_op {
    KeyPair keyPair;
    uint seq;
    t_op() {
        seq = 0;
    }
} OP;

#define MAX_OP_LEN 4000000
typedef struct t_opq {
    OP ops[MAX_OP_LEN];
    int freeOffset;
    int globalSeq;
    bool queueOp(register KeyPair keyPair);
} OpQueue;

bool OpQueue::queueOp(register KeyPair keyPair) {
    bool isFull = false;
    if (freeOffset == (int) (MAX_OP_LEN - 1)) {
        isFull = true;
    }
    ops[freeOffset].keyPair = keyPair;
    ops[freeOffset].seq = globalSeq++;
    freeOffset++;
}

OpQueue opQueue;
#include <sys/time.h>
int main() {
    struct timespec startTime, endTime, totalTime;
    for(int i = 0; i < 4000000; i++) {
        clock_gettime(CLOCK_REALTIME, &startTime);
        opQueue.queueOp(KeyPair());
        clock_gettime(CLOCK_REALTIME, &endTime);
        totalTime.tv_sec += (endTime.tv_sec - startTime.tv_sec);
        totalTime.tv_nsec += (endTime.tv_nsec - startTime.tv_nsec);
    }
    printf("\n elapsed time: %ld", totalTime.tv_sec * 1000000LL + totalTime.tv_nsec / 1000L);
}

500

asked Mar 31 '11 13:03

Nate

2 Answers

YOU don't put the structure into any cache. The CPU does that automatically for you. The CPU is even more clever than that; if you access sequential memory, it will start putting things from memory into the cache before you read them.

And really, it should be common sense that for a simple bit of code like this, the time you spend on measuring is ten times more than the time to perform the code (apparently 60 times in your case).

Since you put so much confidence in clock_gettime (): I suggest you call it five times in a row and store the results, then print the differences. There's resolution, there's precision, and there's how long it takes to return the current time, which is pretty damned long.

137

answered Nov 12 '22 23:11

gnasher729

I have been unable to force caching, but you can force memory to be uncache-able. If you have large other datastructures you might exclude these so that they will not pollute your caches. This can be done by specifying PAGE_NOCACHE for the Windows VirutalAllocXXX functions.

http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786(v=vs.85).aspx

answered Nov 12 '22 23:11

IvoTops

Related questions
                            
                                gcc's __builtin_memcpy performance with certain number of bytes is terrible compared to clang's
                            
                                Fastest way to run recurrent neural network (inference) on mobile device
                            
                                'Share' or 'cache' an expression parameterized by only ambiguous types?
                            
                                Faster alternatives for Electron's IPC module
                            
                                Specify the webpack "mainFields" on a case by case basis
                            
                                Optimizing the backward solve for a sparse lower triangular linear system
                            
                                Maximum average distance between two numbers across multiple arrays
                            
                                Code runs 30% slower in .Net Core compared to .Net Framework - any way to speed things up?
                            
                                Generating all permutations efficiently
                            
                                Efficient elementwise argmin of matrix-vector difference
                            
                                SwiftUI PageTabView in iOS 14.2 performance issues lagging
                            
                                Performance counters on the iPhone
                            
                                How to solve slow scrolling in UITableView
                            
                                Sparse dot product in SQL
                            
                                Same query, different execution plans
                            
                                Performance of C# method polymorphism with generics
                            
                                Is there an MS Test Runner that is faster
                            
                                Spring MVC on GAE: Slow Load Time
                            
                                A good article on modern CPU features/performance optimizations?
                            
                                BerkeleyDB write performance problems

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to put my structure variable into CPU caches to eliminate main memory page access time? Options

Tags:

performance

memory

caching

cpu

Nate

People also ask

2 Answers

gnasher729

IvoTops

Recent Activity

Donate For Us