Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

accuracy of float32

To reduce the filesize, I'm trying to save float64 data to file in float32. The data values generally range from 1e-12 to 10. I tested the accuracy loss when converting float64 to float32.

print np.finfo('float32')

shows

Machine parameters for float32
---------------------------------------------------------------
precision=  6   resolution= 1.0000000e-06
machep=   -23   eps=        1.1920929e-07
negep =   -24   epsneg=     5.9604645e-08
minexp=  -126   tiny=       1.1754944e-38
maxexp=   128   max=        3.4028235e+38
nexp  =     8   min=        -max
---------------------------------------------------------------

Looks float32 has a resolution of 1e-6 and the abs value is valid down to as small as 1.2e-38.

import numpy as np

x = 2.0*np.random.rand(100) - 1.0 # make random numbers in [-1, 1]

print('x.dtype: %s'%(x.dtype)) # outputs float64

print('number :  max_error  max_relative_error')
for i in xrange(-40, 1):
    y = x * 10**i
    print('1e%-4d:  %s'%(i, np.max(np.abs(y - y.astype('f4').astype('f8')))))

The results are

number:    max_error       max_relative_error
1e-40 :    6.915620e-46    6.915620e-06
1e-39 :    6.910361e-46    6.910361e-07
1e-38 :    6.949349e-46    6.949349e-08
1e-37 :    4.816590e-45    4.816590e-08
1e-36 :    4.303771e-44    4.303771e-08
1e-35 :    3.518621e-43    3.518621e-08
1e-34 :    5.165854e-42    5.165854e-08
1e-33 :    3.660088e-41    3.660088e-08
1e-32 :    3.660088e-40    3.660088e-08
1e-31 :    4.097193e-39    4.097193e-08
1e-30 :    4.615068e-38    4.615068e-08
1e-29 :    3.696983e-37    3.696983e-08
1e-28 :    2.999860e-36    2.999860e-08
1e-27 :    4.723454e-35    4.723454e-08
1e-26 :    3.801082e-34    3.801082e-08
1e-25 :    3.062408e-33    3.062408e-08
1e-24 :    4.876378e-32    4.876378e-08
1e-23 :    3.779378e-31    3.779378e-08
1e-22 :    3.144592e-30    3.144592e-08
1e-21 :    4.991049e-29    4.991049e-08
1e-20 :    3.949261e-28    3.949261e-08
1e-19 :    3.002761e-27    3.002761e-08
1e-18 :    5.162480e-26    5.162480e-08
1e-17 :    4.135703e-25    4.135703e-08
1e-16 :    3.282146e-24    3.282146e-08
1e-15 :    4.722129e-23    4.722129e-08
1e-14 :    3.863295e-22    3.863295e-08
1e-13 :    3.375549e-21    3.375549e-08
1e-12 :    4.011790e-20    4.011790e-08
1e-11 :    4.011790e-19    4.011790e-08
1e-10 :    3.392060e-18    3.392060e-08
1e-9  :    5.471206e-17    5.471206e-08
1e-8  :    4.072652e-16    4.072652e-08
1e-7  :    3.496987e-15    3.496987e-08
1e-6  :    5.662626e-14    5.662626e-08
1e-5  :    4.412957e-13    4.412957e-08
1e-4  :    3.482083e-12    3.482083e-08
1e-3  :    5.597344e-11    5.597344e-08
1e-2  :    4.620014e-10    4.620014e-08
1e-1  :    3.540690e-09    3.540690e-08
1e0   :    2.817751e-08    2.817751e-08

The relative error is at the order of 1e-8 for values above 1e-38, lower than 1e-6 proposed by np.finfo and the error is still acceptable even if the value if lower than the tiny value of np.finfo.

Looks it's safe to save my data in float32, but I'm curious about the test looks inconsistent with the results of np.finfo?

like image 830
ddzzbbwwmm Avatar asked Sep 13 '25 11:09

ddzzbbwwmm


1 Answers

Numbers that low are in the subnormal range. Basically, the exponent doesn't have enough range to get sufficiently low, so you're gradually losing significant bits as values get lower. This is called "gradual underflow".

https://en.wikipedia.org/wiki/Denormal_number

like image 196
recursive Avatar answered Sep 16 '25 01:09

recursive