I'm interested in minimising the size of a protobuf message serialised from Python.
Protobuf has floats (4 bytes) and doubles (8 bytes). Python has a float type that's actually a C double, at least in CPython.
My question is: given an instance of a Python float
, is there a "fast" way of checking if the value would lose precision if it was assigned to a protobuf float
(or really a C++ float) ?
It's a problem caused when the internal representation of floating-point numbers, which uses a fixed number of binary digits to represent a decimal number. It is difficult to represent some decimal number in binary, so in many cases, it leads to small roundoff errors.
Proto3 is the latest version of Protocol Buffers and includes the following changes from proto2: Field presence, also known as hasField , is removed by default for primitive fields. An unset primitive field has a language-defined default value.
One of selling points of Protobuf was backward compatibility, i.e. developers can evolve format, and older clients can still use it. Now with new Protobuf version called proto3, the IDL language itself is not compatible as such things as options , required where dropped, new syntax for enuns, no extention.
You can check convert the float to a hex representation; the sign, exponent and fraction each get a separate section. Provided the fraction uses only the first 6 hex digits (the remaining 7 digits must be zero), and the 6th digit is even (so the last bit is not set) will your 64-bit double float fit in a 32-bit single. The exponent is limited to a value between -126 and 127:
import math
import re
def is_single_precision(
f,
_isfinite=math.isfinite,
_singlepat=re.compile(
r'-?0x[01]\.[0-9a-f]{5}[02468ace]0{7}p'
r'(?:\+(?:1[01]\d|12[0-7]|[1-9]\d|\d)|'
r'-(?:1[01]\d|12[0-6]|[1-9]\d|\d))$').match):
return not _isfinite(f) or _singlepat(f.hex()) is not None or f == 0.0
The float.hex()
method is quite fast, faster than roundtripping via struct or numpy; you can create 1 million hex representations in under half a second:
>>> timeit.Timer('(1.2345678901e+26).hex()').autorange()
(1000000, 0.47934128501219675)
The regex engine is also pretty fast, and with name lookups optimised in the function above we can test 1 million float values in about 1.1 seconds:
>>> import random, sys
>>> testvalues = [0.0, float('inf'), float('-inf'), float('nan')] + [random.uniform(sys.float_info.min, sys.float_info.max) for _ in range(2 * 10 ** 6)]
>>> timeit.Timer('is_single_precision(f())', 'from __main__ import is_single_precision, testvalues; f = iter(testvalues).__next__').autorange()
(1000000, 1.1044921400025487)
The above works because the binary32 format for floats allots 23 bits for the fraction. The exponent is allotted 8 bits (signed). The regex only allows for the first 23 bits to be set, and the exponent to be within the range for a signed 8-bit number.
Also see
This may not be what you want however! Take for example 1/3rd or 1/10th. Both are values which require approximation in floating point values, and both fail the test:
>>> (1/3).hex()
'0x1.5555555555555p-2'
>>> (1/10).hex()
'0x1.999999999999ap-4'
You may have to instead take a heuristic approach; if your hex value has all zeros in the first 6 digits of the fraction, or an exponent outside of the (-126, 127) range, converting to double would lead to too much loss.
For completeness, here is the "round tripping through struct" method mentioned in the comments, which has the benefit of not requiring numpy but still giving accurate results:
import struct, math
def is_single_precision_struct(x, _s=struct.Struct("f")):
return math.isnan(x) or _s.unpack(_s.pack(x))[0] == x
Time comparison against is_single_precision_numpy()
:
So it also seems to be faster on my machine.
If you want a simple solution that covers almost all corner cases, and will correctly detect out-of-range exponents as well as loss of information from the smaller precision, you can use NumPy to convert your potential float into an np.float32
object, then compare with the original:
import numpy
def is_single_precision_numpy(floatval, _float32=np.float32):
return _float32(floatval) == floatval
This automatically takes care of potentially problematic cases like values that are in the float32
subnormal range. For example:
>>> is_single_precision_numpy(float.fromhex('0x13p-149'))
True
>>> is_single_precision_numpy(float.fromhex('0x13.8p-149'))
False
Those cases are harder to deal with easily with the hex
-based solution.
While not as fast as @Martijn Pieters' regex-based solution, the speed is still respectable (about half as fast as the regex-based solution). Here are timings (where is_single_precision_re_hex
is exactly the version from Martijn's answer).
>>> timeit.Timer('is_single_precision_numpy(f)', 'f = 1.2345678901e+26; from __main__ import is_single_precision_numpy').repeat(3, 10**6)
[2.035495020012604, 2.0115931580075994, 2.013475093001034]
>>> timeit.Timer('is_single_precision_re_hex(f)', 'f = 1.2345678901e+26; from __main__ import is_single_precision_re_hex').repeat(3, 10**6)
[1.1169273109990172, 1.1178153319924604, 1.1184561859990936]
Unfortunately, while almost all corner cases (subnormals, infinities, signed zeros, overflows, etc.) are handled correctly, there's one corner case that this solution won't work for: the case that floatval
is a NaN. In that case, is_single_precision_numpy
will return False
. That may or may not matter for your needs. If it does matter, then adding an extra isnan
check should do the trick:
import math
def is_single_precision_numpy(floatval, _float32=np.float32, _isnan=math.isnan):
return _float32(floatval) == floatval or _isnan(floatval)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With