I have an array of timestamps, increasing for each row in the 2nd column of matrix X. I calculate the mean value of the timestamps and it's larger than the max value. I'm using a numpy memmap for storage. Why is this happening?
>>> self.X[:,1]
memmap([ 1.45160858e+09, 1.45160858e+09, 1.45160858e+09, ...,
1.45997146e+09, 1.45997683e+09, 1.45997939e+09], dtype=float32)
>>> np.mean(self.X[:,1])
1.4642646e+09
>>> np.max(self.X[:,1])
memmap(1459979392.0, dtype=float32)
>>> np.average(self.X[:,1])
1.4642646e+09
>>> self.X[:,1].shape
(873608,)
>>> np.sum(self.X[:,1])
memmap(1279193195216896.0, dtype=float32)
>>> np.sum(self.X[:,1]) / self.X[:,1].shape[0]
memmap(1464264515.9120522)
EDIT: I have uploaded the memmap file here. http://www.filedropper.com/x_2 This is how I load it.
filepath = ...
shape = (875422, 23)
X = np.memmap(filepath, dtype="float32", mode="r", shape=shape)
# I preprocess X by removing rows with all 0s
# note this step doesn't affect the problem
to_remove = np.where(np.all(X == 0, axis=1))[0]
X = np.delete(X, to_remove, axis=0)
This is not a numpy or memmap issue. The issue is with floating point, float32
to be precise. You can see the same error happening in other languages like C++.
The float32
accumulator used gets imprecise as more and more numbers are added to it.
In [26]: a = np.ones((1024,1024), dtype=np.float32)*4567
In [27]: a.min()
Out[27]: 4567.0
In [28]: a.max()
Out[28]: 4567.0
In [29]: a.mean()
Out[29]: 4596.5264
This won't happen in np.float64
type (gives some more breathing room).
In [30]: a = np.ones((1024,1024), dtype=np.float64)*4567
In [31]: a.min()
Out[31]: 4567.0
In [32]: a.mean()
Out[32]: 4567.0
You can make mean()
to use a float64
buffer by specifying it explicitly.
In [12]: a = np.ones((1024,1024), dtype=np.float32)*4567
In [13]: a.mean(dtype=np.float64)
Out[13]: 4567.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With