Bit fields: Set vs test-and-set (for performance)

Question

I have a large number of instances of a C structure like this:

struct mystruct
{
    /* ... */
    unsigned flag: 1;
    /* ... */
};

flag is initially 0 but must be 1 on exit from a certain function.

The simplest implementation is:

void set_flag(struct mystruct *sp)
{
    sp->flag = 1U;
}

But what is the likely effect on performance of doing this instead:

void set_flag(struct mystruct *sp)
{
    if (sp->flag == 0U)
    {
        sp->flag = 1U;
    }
}

I am hoping to avoid a write to main memory. The first version always does the write, the second version only performs the write if the flag was not already set, but in the vast majority of cases, the flag will already be set.

What other factors (e.g. branch prediction) are likely to affect performance?

I have seen a small speed increase so far, which I hope will become more significant as the data set becomes larger.

Is there a risk of this change making the program slower for large data sets, and if so in what circumstances might this happen?

Nils Pipenbrinck · Accepted Answer

The test before set does make a difference, but how much it is depends on your use-cases.

The data will end up in a cache-line in either case (e.g. just writing or test-and-set).

However, there is a difference if your cache line is tagged as dirty (e.g. modified) or clean. Dirty cache-lines have to be written back to main memory while clean cache-lines can just be forgotten and filled with new data.

Now consider that your code mangles huge amounts of data, and you access each chunk of data only once or twice. If so can assume that most of the memory accesses are cache misses. What happens if the majority of your cache-lines are dirty at the point where a cache miss occurs and the majority of cache lines are dirty?

They have to be written back to the main memory before new data will be loaded into the line. This is slower than just forgetting the content of a cache-line. Also it will double the memory bandwidth between the cache and the main memory.

That may not make a difference for once CPU core since memory is fast these days, but another CPU will (hopefully) do some other work as well. You can be sure that the other CPU core will execute everything a tad faster if the buss is not busy moving cache-lines in and out.

In short: keeping your cache-lines clean will half that bandwidth requirement and makes cache-misses a tad cheaper.

Regarding the branch: Sure: It's costly, but a cache-miss is much worse! Also if you're lucky the CPU will use it's out of order execution features to offset cache misses with the costs of the branch.

If you really want to get the best possible performance out of this code, and if most of your accesses are cache-misses you have two options:

Bypass the cache: The x86 architecture has non temporal loads and stores for this purpose. They're hidden somewhere in the SSE instruction sets and can be used from the c-language via intrinsics.
(Only for experts): Use some lines of inline-assembler that replaces the test-and-set function with assembler that uses the CMOV (conditional move) instruction. This will not only keep your cache-lines clean but avoid the branch. Now CMOV is a slow instruction and will only outperform a branch if the branches cannot be predicted. So you'll better benchmark your code.

Nate Kohl · Answer

This is an interesting question, and Nils' answer about cache lines is definitely great advice.

I'd like to emphasize the importance of profiling code to measure real performance -- can you measure how frequently that flag will already be set in the data that you encounter? Performance could change a lot depending on the answer.

Just for fun, I used your code to run a little comparison of set versus test-then-set on a 50-million element array filled with various proportions of 1's. Here's a graph:

comparison of set vs. test-then-set
_{(source: natekohl.net)}

This is just a toy example, of course. But note the non-linear performance -- which I wasn't expecting -- and that test-then-set becomes faster than plain set when the array is almost entirely filled with 1's.

nik · Answer

These are my interpretations of your requirement,

you have the flag initialized separately
it is set only once (to 1) and not reset after that
But, this set attempt will be made many times on the same flag
And, you have a lot these flag instances (each needing the same kind of processing)

Assuming that,

space optimization is weighted quite lower than time optimization,

I suggest the following things.

Firstly, on 32-bit systems it helps to use 32-bit integers if you are worried about access times
If you skip a check on the flag 'word', the write will be quite fast. But, given that you have a very large number of flags which you will keep checking and set if not already set, it would be better to keep the conditional check in.
But, having said that, if your platform does parallel operations, (for example, a write to the disk can be sent off in parallel to your code execution usually) it would be worthwhile skipping the check.

Bit fields: Set vs test-and-set (for performance)

Tags:

c

optimization

bit-fields

finnw

3 Answers

Nils Pipenbrinck

Nate Kohl

nik

Recent Activity

Donate For Us

Bit fields: Set vs test-and-set (for performance)

Tags:

c

optimization

bit-fields

finnw

3 Answers

Nils Pipenbrinck

Nate Kohl

nik

Related questions

Recent Activity

Donate For Us