I've come across a strange problem: I have the following code:
int matches = 0;
for (int str_id = 0; str_id < STR_COUNT; str_id++) {
if (test(strings1[str_id], strings2[str_id]) == 0)
matches++;
}
It compares pairs of null-terminated strings using the test()
function. strings1
and strings2
are vectors containing STR_COUNT
null-terminated strings of the same length.
Depending on whether test()
dereferences its arguments, this snippet executes in constant or in linear time with regards to the length of the strings in strings1
and strings2
. That is, if i use:
int test(char* a, char* b) {
return (a != b)
}
then the running time doesn't depend on the length of the strings stored in strings1 and strings2. On the other hand, if I use
int test(char* a, char* b) {
return (*a != *b)
}
then the running time increases linearly with the length of the strings stored in strings1
and strings2
.
Why could this happen?
EDIT: Full example of the problem here: http://pastebin.com/QTPAkP1g
You are seeing the effects of data locality.
In the case where you are merely comparing pointers, the operation accesses only the memory in the two vectors. Vectors store their elements contiguously, so each memory access is to a location very close to the one accessed during the previous iteration. This is very good locality, and the cache smiles upon you.
In the case where you dereference the pointers, you are adding additional memory accesses to the mix, so the cache has more work to do, and the effect depends a lot on the implementation.
Extrapolating from your data, it appears that the strings are packed together in memory, so the distance from the start of one string to the start of the next string depends on the length of the string. Short strings are packed more tightly together than longer strings.
In particular, you may be able to pack a few very short strings into a single cache line. When that happens, a single cache line can service the memory accesses of multiple iterations. As the strings get longer, fewer of them will fit into a single cache line, so the cache efficiency decreases. Eventually, the strings are long enough that each string occupies a separate cache line, and the cache provides no benefit.
Because in the first case it can be proved that as long as strings1 != strings2
, the condition will never be true. An optimizing compiler can just deduce that the whole loop will never have any observable side effect, so it can optimize it to oblivion.
Consider that strings[str_id]
is equal to strings + str_id * sizeof(*strings)
; let's assume that the sizeof
is equal to 1 for simplicity (we can do this without loss of generality). Your condition then becomes:
if (test(strings1 + str_id, strings2 + str_id) == 0)
If the compiler is able to inline test
, with the first version of test
the code would become
if ((strings1 + str_id != strings2 + str_id) == 0)
or (successive simplified but equivalent forms)
if (strings1 + str_id == strings2 + str_id)
if (strings1 == strings2)
So, since strings1 != strings2
(this is almost certainly the case) and since the compiler can assume that strings1
and strings2
will not be modified by external causes it can simply skip the whole loop and just do nothing instead. Doing nothing is constant time.
With the second version of test
there is no way to know if the condition is true other than actually executing the loop and dereferencing the pointers on each iteration, so the running time becomes linear.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With