I have the following C++ code snippet (the C++ part is the profiler class which is omitted here), compiled with VS2010 (64bit Intel machine). The code simply multiplies an array of floats (arr2
) with a scalar, and puts the result into another array (arr1
):
int M = 150, N = 150;
int niter = 20000; // do many iterations to have a significant run-time
float *arr1 = (float *)calloc (M*N, sizeof(float));
float *arr2 = (float *)calloc (M*N, sizeof(float));
// Read data from file into arr2
float scale = float(6.6e-14);
// START_PROFILING
for (int iter = 0; iter < niter; ++iter) {
for (int n = 0; n < M*N; ++n) {
arr1[n] += scale * arr2[n];
}
}
// END_PROFILING
free(arr1);
free(arr2);
The reading-from-file part and profiling (i.e run-time measurement) is omitted here for simplicity.
When arr2
is initialized to random numbers in the range [0 1], the code runs about 10 times faster as compared to a case where arr2
is initialized to a sparse array in which about 2/3 of the values are zeros. I have played with the compiler options /fp
and /O
, which changed the run-time a little bit, but the ratio of 1:10 was approximately kept.
EDIT
Complete code is here: https://gist.github.com/1676742, the command line for compiling is in a comment in test.cpp
.
The data files are here:
Probably that's because your "fast" data consists only of normal floating point numbers, but your "slow" data contains lots of denormalized numbers.
As for your second question, you can try to improve speed with this (and treat all denormalized numbers as exact zeros):
#include <xmmintrin.h>
_mm_setcsr(_mm_getcsr() | 0x8040);
I can think of two reasons for this.
First, the branch predictor may be making incorrect decisions. This is one potential cause of performance gaps caused by data changes without code changes. However, in this case, it seems very unlikely.
The second possible reason is that your "mostly zeros" data doesn't really consist of zeros, but rather of almost-zeros, or that you're keeping arr1
in the almost-zero range. See this Wikipedia link.
There is nothing strange that the data from I.bin takes longer to process: you have lots of numbers like '1.401e-045#DEN' or '2.214e-043#DEN', where #DEN means the number cannot be normalized to the standard float precision. Given that you are going to multiply it by 6.6e-14 you'll definitely have underflow exceptions, which significantly slows down calculations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With