I want to explain my question through a practical problem I met in my project.
I am writing a c library( which behaves like a programmable vi editor
), and i plan to provide a series of APIs ( more than 20 in total ):
void vi_dw(struct vi *vi);
void vi_de(struct vi *vi);
void vi_d0(struct vi *vi);
void vi_d$(struct vi *vi);
...
void vi_df(struct vi *, char target);
void vi_dd(struct vi *vi);
These APIs do not perform core operations, they are just wrappers. For example, I can implement vi_de()
like this:
void vi_de(struct vi *vi){
vi_v(vi); //enter visual mode
vi_e(vi); //press key 'e'
vi_d(vi); //press key 'd'
}
However, if the wrapper is as simple as such, I have to write more than 20 similar wrapper functions.
So, I consider implementing more complex wrappers to reduce the amount:
void vi_d_move(struct vi *vi, vi_move_func_t move){
vi_v(vi);
move(vi);
vi_d(vi);
}
static inline void vi_dw(struct vi *vi){
vi_d_move(vi, vi_w);
}
static inline void vi_de(struct vi *vi){
vi_d_move(vi, vi_e);
}
...
The function vi_d_move()
is a better wrapper function, he can convert a part of similar move operation to APIs, but not all, like vi_f()
, which need another wrapper with a third argument char target
.
I finished explaining the example picked from my project.
The pseudo code above is simper than real case, but is enough to show that:
The more complex the wrapper is, the less wrappers we need, and the slower they will be.(they will become more indirect or need to consider more conditions).
There are two extremes:
use only one wrapper but complex enough to adopt all move operations and convert them into corresponding APIs.
use more than twenty small and simple wrappers. one wrapper is one API.
For case 1, the wrapper itself is slow, but it has more chance resident in cache, because it is often executed(all APIs share it). It's a slow but hot path.
For case 2, these wrappers are simple and fast, but has less chance resident in cache. At least, for any API first time called, a cache miss will happen.(CPU need to fetch instructions from memory, but not L1, L2).
Currently, I implemented five wrappers, each of them are relatively simple and fast. this seems to be a balance, but just seems. I chose five just because I felt the move operation can be divided into five groups naturally. I have no idea how to evaluate it, I don't mean a profiler, I mean, in theory, what main factors should be considered in such case?
In the post end, I want to add more detail for these APIs:
These APIs need to be fast. Because this library is designed as a high performance virtual editor. The delete/copy/paste operation is designed to approach the bare C code.
A user program based on this library seldom calls all these APIs, only parts of them, and usually no more than 10 times for each.
In real case, the size of these simple wrappers are about 80 bytes each, and will be no more than 160 bytes even merged into a single complex one. (but will introduce more if-else branches).
4, As with the situation the library is used, I will take lua-shell
as example(a little off-topic, but some friends want to know why I so care its performance):
lua-shell
is a *nix shell which uses lua
as its script. Its command execution unit(which do forks(), execute()..) is just a C module registered into the lua state machine.
Lua-shell
treats everything as lua
.
So, When user input:
local files = `ls -la`
And press Enter
. The string input is first sent to lua-shell's preprocessor————which convert mixed-syntax to pure lua code:
local file = run_command("ls -la")
run_command()
is the entry of lua-shell's command execution unit, which, I said before, is a C module.
We can talk about libvi
now. lua-shell's preprocessor is the first user of the library I am writing. Here is its relative codes(pseudo):
#include"vi.h"
vi_loadstr("local files = `ls -la`");
vi_f(vi, '`');
vi_x(vi);
vi_i(vi, "run_command(\"");
vi_f(vi, '`');
vi_x(vi);
vi_a(" \") ");
The code above is parts of luashell's preprocessor implementation. After generating the pure lua code, he feeds it to Lua State Machine and run it.
The shell user is sensitive to the time interval between Enter
and a new prompt, and in most case lua-shell needs preprocess script with larger size and more complicate mixed-syntax.
This is a typical situation where libvi
is used.
When a cache miss occurs, the system or application proceeds to locate the data in the underlying data store, which increases the duration of the request. Typically, the system may write the data to the cache, again increasing the latency, though that latency is offset by the cache hits on other data.
As previously explained, a cache miss occurs when data is requested from the cache, and it's not found. Then, the data is copied into the cache for later use. The more cache misses you have piled up, the more data that has to be written into the memory.
I won't care that much about cache misses (especially in your case), unless your benchmarks (with compiler optimizations enabled, i.e. compile with gcc -O2 -mtune=native
if using GCC....) indicate that they matter.
If performances matters that much, enable more optimizations (perhaps compiling and linking your entire application or library with gcc -flto -O2 -mtune=native
that is with link-time optimizations), and hand-optimize only what is critical. You should trust your optimizing compiler.
If you are in the design phase, consider perhaps making your application multi-threaded or somehow concurrent and parallel. With care, this could speedup it more than cache optimizations.
It is unclear what your library is about and what are your design goals. A possibility to add flexibility might be embed some interpreter (like lua or guile or python, etc...) in your application, hence configuring it thru scripts. In many cases, such an embedding could be fast enough (especially when the application specific primitives are of high enough level). Another (more complex) possibility is to provide metaprogramming abilities perhaps thru some JIT compiling library like libjit or libgccjit (so you would sort-of "compile" user scripts into dynamically produced machine code).
BTW, your question seems to focus on instruction cache misses. I would believe that data cache misses are more important (and less optimizable by the compiler), and that is why you would prefer e.g. vectors to linked lists (and more generally care about low-level data structures, focusing on using sequential -or cache-friendly- accesses)
(you could find a good video by Herb Sutter which explains that last point; I forgot the reference)
In some very specific cases, with recent GCC or Clang, adding a few __builtin_prefetch
might slightly improve performance (by decreasing cache misses), but it could also harm it significantly, so I don't recommend using it in general, but see this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With