Sometimes I've got a nan as a result of multiplication no-nan b and c:
double a = b * c; //b = 0, c = 1024, a = nan
or as a result of floor():
double a = floor(b); //b = 2024, a = nan
Duplicated calculation and usage of sleep() prevent this issue:
a = b * c; //a = nan
a = b * c; //a = 0
a = floor(b); //a = nan
a = floor(b); //a = 2024
sleep(1);
a = b * c; //a = 0
sleep(1);
a = floor(b); //a = 2024
CPU is AMD Athlon(tm) 64 X2 Dual Core Processor 3400+
CPU temp:
k8temp-pci-00c3
Adapter: PCI adapter
Core0 Temp: -1В°C
Core0 Temp: -2В°C
Core1 Temp: +3В°C
Core1 Temp: +7В°C
Adapter: SMBus PIIX4 adapter at 0b00
M/B Temp: +30В°C (low = +0В°C, high = +85В°C)
CPU Temp: +28.5В°C (low = +0.0В°C, high = +85.0В°C)
M/B Crit: +85В°C (hyst = +75В°C)
CPU Crit: +124В°C (hyst = +114В°C)
May this issue be the result of CPU timing features? Or is there another causes of the issue?
UPDATE
I found out the following program produce nan on that machine:
double a, b, c;
while(1) {
a = 0;
b = 1024;
c = a * b; //c will be nan within 10-20 sec.
}
So we're taking the mean of five values: 1, Inf, -Inf, 2, and 3. As part of that mean calculation we need to add those five elements together using sum (as is normal for the standard arithmetic mean) and adding Inf and -Inf together results in NaN.
The results could be: -N for NaN, an inf for Inf, or -N for N for Input. There could be a 0 in your training data. It could therefore happen that you divide your results by 0 in your loss function, therefore leaving zero in your loss function. Why Did Nan Lose Keras?
NaN (Not a Number) is the result of a numeric expression when it cannot produce a valid numeric result. Consider l o g ( − 2) and l o g ( − 3). Both those expression will return NaN (since the is no real number that is the log of a negative).
If your data contains values that result in a NaN being computed during the process of computing the mean then you'll receive NaN. The NaN in the second element of x2 is ignored in the mean calculation. So we're taking the mean of five values: 1, Inf, -Inf, 2, and 3.
Any chance you have a stack or memory overwrite occuring from elsewhere in the program - bad thread handling or bad handled mutex? Adding a sleep to "fix" the problem makes me think it could be a concurrency issue. If possible, debug the values and see if they change on the fly from other locations with a write to memory break point or perhaps just some printfs (which might change the timing of the problem and hide it as well.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With