I used _rdtsc()
to time atoi()
and atof()
and I noticed they were taking pretty long. I therefore wrote my own versions of these functions which were much quicker from the first call.
I am using Windows 7, VS2012 IDE but with the Intel C/C++ compiler v13. I have -/O3 enabled and also -/Ot ("favour fast code"). My CPU is an Ivy Bridge (mobile).
Upon further investigation, it seemed that the more times atoi()
and atof()
were called, the quicker they executed?? I am talking magnitudes faster:
When I call atoi()
from outside my loop, just the once, it takes 5,892 CPU cycles but after thousands of iterations this reduced to 300 - 600 CPU cycles (quite a large execution time range).
atof()
initially takes 20,000 to 30,000 CPU cycles and then later on after a few thousand iterations it was taking 18 - 28 CPU cycles (which is the speed at which my custom function takes the first time it is called).
Could someone please explain this effect?
EDIT: forgot to say- the basic setup of my program was a loop parsing bytes from a file. Inside the loop I obviously use my atof and atoi to notice the above. However, what I also noticed is that when I did my investigation before the loop, just calling atoi and atof twice, along with my user-written equivalent functions twice, it seemed to make the loop execute faster. The loop processed 150,000 lines of data, each line requiring 3x atof()
or atoi()
s. Once again, I cannot understand why calling these functions before my main loop affected the speed of a program calling these functions 500,000 times?!
#include <ia32intrin.h>
int main(){
//call myatoi() and time it
//call atoi() and time it
//call myatoi() and time it
//call atoi() and time it
char* bytes2 = "45632";
_int64 start2 = _rdtsc();
unsigned int a2 = atoi(bytes2);
_int64 finish2 = _rdtsc();
cout << (finish2 - start2) << " CPU cycles for atoi()" << endl;
//call myatof() and time it
//call atof() and time it
//call myatof() and time it
//call atof() and time it
//Iterate through 150,000 lines, each line about 25 characters.
//The below executes slower if the above debugging is NOT done.
while(i < file_size){
//Loop through my data, call atoi() or atof() 1 or 2 times per line
switch(bytes[i]){
case ' ':
//I have an array of shorts which records the distance from the beginning
//of the line to each of the tokens in the line. In the below switch
//statement offset_to_price and offset_to_qty refer to this array.
case '\n':
switch(message_type){
case 'A':
char* temp = bytes + offset_to_price;
_int64 start = _rdtsc();
price = atof(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
case 'R':
//Get the 4th line token using atoi() as above
char* temp = bytes + offset_to_qty;
_int64 start = _rdtsc();
price = atoi(temp);
_int64 finish = _rdtsc();
cout << (finish - start) << " CPU cycles" << endl;
//Other processing with the tokens
break;
}
break;
}
}
}
The lines in the file are like this (with no blank lines in between):
34605792 R dacb 100
34605794 A racb S 44.17 100
34605797 R kacb 100
34605799 A sacb S 44.18 100
34605800 R nacb 100
34605800 A tacb B 44.16 100
34605801 R gacb 100
I am using atoi()
on the 4th element in the 'R' messages and 5th element in 'A' messages and using atof()
on the 4th element in the 'A' messages.
I'm guessing the reason why you see such a drastic improvement for atoi
and atof
, but not for your own, simpler function, is that the former have a large number of branches in order to handle all the edge cases. The first few times, this leads to a large number of incorrect branch predictions, which are costly. But after a few times, the predictions get more accurate. A correctly predicted branch is almost free, which would then make them competitive with your simpler version which doesn't include the branches to begin with.
Caching surely also important, but I don't think that explains why your own function was fast from the beginning, and did not see any relevant improvements after repeated execution (if I understand you correctly).
Using RDTSC for profiling is dangerous. From the Intel processor manual:
The RDTSC instruction is not a serializing instruction. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed. If software requires RDTSC to be executed only after all previous instructions have completed locally, it can either use RDTSCP (if the processor supports that instruction) or execute the sequence LFENCE;RDTSC.
With the inevitable Heisenberg effect that causes, you'll now measure the cost of RDTSCP or LFENCE. Consider measuring a loop instead.
Measuring performance for a single call like this isn't advisable. You'd get too many variance due to power throttles, interrupts and other OS/system interferences, measurement overhead, and as said above - cold/warm variance. On top of that, rdtsc is no longer considered a reliable measurement since your CPU may throttle its own frequency, but for the sake of this simple check we can say it's good enough.
You should run your code at least several thousands of times, discard some portion at the beginning, and then divide to get the average - that would give you the "warm" performance, which would include (as mentioned in the comments above) close caches hit latency for both code and data (and also TLBs), good branch prediction, and might also negate some of the external impacts (such as having only recently woken up your CPU from a powerdown state).
Of course, you may argue that this performance is too optimistic because in real scenarios you won't always hit the L1 cache etc.. - it may still be fine for comparing two different methods (such as competing with the library ato* functions), just don't count on the results for real life. You can also make the test slightly harder, and call the function with a more elaborate pattern of inputs that would stress the caches a bit better.
As for your question regarding the 20k-30k cycles - that's exactly the reason why you should discard the first few iterations. This isn't just cache-miss latency, you're actually waiting for the first instructions to do a code fetch, which may also wait for the code page translation to do a page walk (a long process that may involve multiple memory accesses), and if you're really unlucky - also swapping in a page from disk, which requires OS assistance and lots of IO latency. And this is still before you started executing the first instruction.
The most likely explanation is that because you are calling atoi/atof so often, it is being identified as a hot spot and thus being kept in the Level 1 or Level 2 processor code cache. The CPU's replacement policy -- that microcode that determines what cache lines can be purged when a cache miss occurs) would tag such a hot spot to be kept in cache. There's a decent write up of cpu caching technlogies on wikipedia, if you are interested.
Your initial timings were low because your code wasn't in the CPU's most performant cache yet, but once invoked some number of times, were.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With