Let me start with the example code:
import numpy
from pandas import DataFrame
a = DataFrame({"nums": [2233, -23160, -43608]})
a.nums = numpy.int64(a.nums)
print(a.nums ** 2)
print((a.nums ** 2).sum())
On my local machine, and other devs' machines, this works as expected and prints out:
0 4986289
1 536385600
2 1901657664
Name: nums, dtype: int64
2443029553
However, on our production server, we get:
0 4986289
1 536385600
2 1901657664
Name: nums, dtype: int64
-1851937743
Which is 32-bit integer overflow, despite it being an int64.
The production server is using the same versions of python, numpy, pandas, etc. It's a 64-bit Windows Server 2012
OS and everything reports 64-bit (e.g. python --version
, sys.maxsize
, plastform.architecture
).
What could possibly be causing this?
This is a bug in the bottleneck
library, which Pandas uses if it's installed. In some circumstances, bottleneck.nansum
incorrectly has 32-bit overflow behavior when called on 64-bit input.
I believe this is due to bottleneck
using PyInt_FromLong
even when long
is 32-bit. I'm not sure why that even compiles, actually. There's an issue report on the bottleneck issue tracker, not yet fixed, as well as an issue report on the Pandas issue tracker, where they tried to compensate for Bottleneck's issue (but I think they turned off Bottleneck when it does work instead of when it doesn't).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With