when should we care about cache missing?

Tags:

I want to explain my question through a practical problem I met in my project.

I am writing a c library( which behaves like a programmable vi editor), and i plan to provide a series of APIs ( more than 20 in total ):

void vi_dw(struct vi *vi);
void vi_de(struct vi *vi);
void vi_d0(struct vi *vi);
void vi_d$(struct vi *vi);
...
void vi_df(struct vi *, char target);
void vi_dd(struct vi *vi);

These APIs do not perform core operations, they are just wrappers. For example, I can implement vi_de() like this:

void vi_de(struct vi *vi){
    vi_v(vi);  //enter visual mode
    vi_e(vi);  //press key 'e'
    vi_d(vi);  //press key 'd'
}

However, if the wrapper is as simple as such, I have to write more than 20 similar wrapper functions.
So, I consider implementing more complex wrappers to reduce the amount:

void vi_d_move(struct vi *vi, vi_move_func_t move){
   vi_v(vi);
   move(vi);
   vi_d(vi);
}
static inline void vi_dw(struct vi *vi){
    vi_d_move(vi, vi_w);
}
static inline void vi_de(struct vi *vi){
    vi_d_move(vi, vi_e);
}
...

The function vi_d_move() is a better wrapper function, he can convert a part of similar move operation to APIs, but not all, like vi_f(), which need another wrapper with a third argument char target .

I finished explaining the example picked from my project.
The pseudo code above is simper than real case, but is enough to show that:
The more complex the wrapper is, the less wrappers we need, and the slower they will be.(they will become more indirect or need to consider more conditions).

There are two extremes:

use only one wrapper but complex enough to adopt all move operations and convert them into corresponding APIs.
use more than twenty small and simple wrappers. one wrapper is one API.

For case 1, the wrapper itself is slow, but it has more chance resident in cache, because it is often executed(all APIs share it). It's a slow but hot path.

For case 2, these wrappers are simple and fast, but has less chance resident in cache. At least, for any API first time called, a cache miss will happen.(CPU need to fetch instructions from memory, but not L1, L2).

Currently, I implemented five wrappers, each of them are relatively simple and fast. this seems to be a balance, but just seems. I chose five just because I felt the move operation can be divided into five groups naturally. I have no idea how to evaluate it, I don't mean a profiler, I mean, in theory, what main factors should be considered in such case?

In the post end, I want to add more detail for these APIs:

These APIs need to be fast. Because this library is designed as a high performance virtual editor. The delete/copy/paste operation is designed to approach the bare C code.
A user program based on this library seldom calls all these APIs, only parts of them, and usually no more than 10 times for each.
In real case, the size of these simple wrappers are about 80 bytes each, and will be no more than 160 bytes even merged into a single complex one. (but will introduce more if-else branches).

4, As with the situation the library is used, I will take lua-shell as example(a little off-topic, but some friends want to know why I so care its performance):

lua-shell is a *nix shell which uses lua as its script. Its command execution unit(which do forks(), execute()..) is just a C module registered into the lua state machine.

Lua-shell treats everything as lua .

So, When user input:

local files = `ls -la`

And press Enter. The string input is first sent to lua-shell's preprocessor————which convert mixed-syntax to pure lua code:

local file = run_command("ls -la")

run_command() is the entry of lua-shell's command execution unit, which, I said before, is a C module.

We can talk about libvi now. lua-shell's preprocessor is the first user of the library I am writing. Here is its relative codes(pseudo):

#include"vi.h"
vi_loadstr("local files = `ls -la`");
vi_f(vi, '`');
vi_x(vi);
vi_i(vi, "run_command(\"");
vi_f(vi, '`');
vi_x(vi);
vi_a(" \") ");

The code above is parts of luashell's preprocessor implementation. After generating the pure lua code, he feeds it to Lua State Machine and run it.

The shell user is sensitive to the time interval between Enter and a new prompt, and in most case lua-shell needs preprocess script with larger size and more complicate mixed-syntax.

This is a typical situation where libvi is used.

532

asked Jul 05 '17 09:07

weiweishuo

1 Answers

I won't care that much about cache misses (especially in your case), unless your benchmarks (with compiler optimizations enabled, i.e. compile with gcc -O2 -mtune=native if using GCC....) indicate that they matter.

If performances matters that much, enable more optimizations (perhaps compiling and linking your entire application or library with gcc -flto -O2 -mtune=native that is with link-time optimizations), and hand-optimize only what is critical. You should trust your optimizing compiler.

If you are in the design phase, consider perhaps making your application multi-threaded or somehow concurrent and parallel. With care, this could speedup it more than cache optimizations.

It is unclear what your library is about and what are your design goals. A possibility to add flexibility might be embed some interpreter (like lua or guile or python, etc...) in your application, hence configuring it thru scripts. In many cases, such an embedding could be fast enough (especially when the application specific primitives are of high enough level). Another (more complex) possibility is to provide metaprogramming abilities perhaps thru some JIT compiling library like libjit or libgccjit (so you would sort-of "compile" user scripts into dynamically produced machine code).

BTW, your question seems to focus on instruction cache misses. I would believe that data cache misses are more important (and less optimizable by the compiler), and that is why you would prefer e.g. vectors to linked lists (and more generally care about low-level data structures, focusing on using sequential -or cache-friendly- accesses)

^{(you could find a good video by Herb Sutter which explains that last point; I forgot the reference)}

In some very specific cases, with recent GCC or Clang, adding a few __builtin_prefetch might slightly improve performance (by decreasing cache misses), but it could also harm it significantly, so I don't recommend using it in general, but see this.

answered Sep 25 '22 21:09

Basile Starynkevitch

Related questions
                            
                                Short circuit behavior of logical expressions in C in this example
                            
                                kernel module no debugging symbols found
                            
                                XCB get events of all windows X.Org
                            
                                S_IFMT and S_IFREG undefined with -std=c11 or -std=gnu11
                            
                                What's the purpose of unnamed bit field at the end of structure
                            
                                Why is struct addrinfo defined only if __USE_XOPEN2K is defined?
                            
                                How does the expression -2[array] work? [duplicate]
                            
                                In C, how to print out a txt file line by line?
                            
                                Macro Expansion: Argument with Commas
                            
                                Using Py_BuildValue() to create a list of tuples in C
                            
                                Confusion about sizeof operator in C
                            
                                How do I "tell" to C compiler that the code shouldn't be optimized out?
                            
                                What is the use of ignoring `SIGCHLD` signal with `sigaction(2)`?
                            
                                API call to get current heap size of process?
                            
                                Parse HTTP Request Line In C
                            
                                Convert 32 bit unix timestamp to std::string using std::chrono
                            
                                What is the preferred way to write GTK-applications?
                            
                                Is the write() function in C blocking or non-blocking?
                            
                                why use usleep and not sleep [closed]
                            
                                How to wrap existing function in C

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

when should we care about cache missing?

Tags:

c

vim

caching

weiweishuo

People also ask

1 Answers

Basile Starynkevitch

Recent Activity

Donate For Us