I need to store a massive numpy vector to disk. Right now the vector that I am trying to store is ~2.4 billion elements long and the data is float64
. This takes about 18GB of space when serialized out to disk.
If I use struct.pack()
and use float32
(4 bytes) I can reduce it to ~9GB. I don't need anywhere near this amount of precision disk space is going to quickly becomes an issue as I expect the number of values I need to store could grow by an order of magnitude or two.
I was thinking that if I could access the first 4 significant digits I could store those values in an int and only use 1 or 2 bytes of space. However, I have no idea how to do this efficiently. Does anyone have any idea or suggestions?
Single-precision values with float type have 4 bytes, consisting of a sign bit, an 8-bit excess-127 binary exponent, and a 23-bit mantissa. The mantissa represents a number between 1.0 and 2.0. Since the high-order bit of the mantissa is always 1, it is not stored in the number.
Yes it has 4 bytes only but it is not guaranteed.
Scalars of type float are stored using four bytes (32-bits). The format used follows the IEEE-754 standard. The mantissa represents the actual binary digits of the floating-point number.
If your data is between 0 and 1, and 16bit is enough you can save the data as uint16:
data16 = (65535 * data).round().astype(uint16)
and expand the data with
data = data16 / 65535.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With