Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.sum() method in pandas gives inconsistent results

Tags:

python

pandas

sum

I have a large DataFrame (circa 4e+07 rows).

When summing it, I get 2 significantly different results whether I do the sum before or after the column selection.
Also, the type changes from float32 to float64 even though totals are all below 2**31

df[[col1, col2, col3]].sum()
Out[1]:
col1         9.36e+07
col2         1.39e+09
col3         6.37e+08
dtype: float32

df.sum()[[col1, col2, col3]]
Out[2]:
col1         1.21e+08
col2         1.70e+09
col3         7.32e+08
dtype: float64

I am obviously missing something, has anybody had the same issue?

Thanks for your help.

like image 455
Fafon Avatar asked Jan 29 '23 08:01

Fafon


1 Answers

To understand what's going on here, you need to understand what Pandas is doing under the hood. I'm going to simplify a bit, since there are lots of bells and whistles and special cases to consider, but roughly it looks like this:

Suppose you've got a Pandas DataFrame object df with various numeric columns (we'll ignore datetime columns, categorical columns, and the like). When you compute df.sum(), Pandas:

  1. Extracts the values of the dataframe into a two-dimensional NumPy array.
  2. Applies the NumPy sum function to that 2d array with axis=0 to compute the column sums.

It's the first step that's important here. The columns of a DataFrame might have different dtypes, but a 2d NumPy array can only have a single dtype. If df has a mixture of float32 and int32 columns (for example), Pandas has to choose a single dtype that's appropriate for both columns simultaneously, and in this case it chooses float64. So when the sum is computed, it's computed on double-precision values, using double-precision arithmetic. This is what's happening in your second example.

On the other hand, if you cut down to just the float32 columns in the first place, then Pandas can and will use the float32 dtype for the 2d NumPy array, and so the sum computation is performed in single precision. This is what's happening in your first example.

Here's a simple example showing this in action: we'll set up a DataFrame with 100 million rows and three columns, of dtypes float32, float32 and int32 respectively. All the values are ones:

>>> import numpy as np, pandas as pd
>>> s = np.ones(10**8, dtype=np.float32)
>>> t = np.ones(10**8, dtype=np.int32)
>>> df = pd.DataFrame(dict(A=s, B=s, C=t))
>>> df.head()
     A    B  C
0  1.0  1.0  1
1  1.0  1.0  1
2  1.0  1.0  1
3  1.0  1.0  1
4  1.0  1.0  1
>>> df.dtypes
A    float32
B    float32
C      int32
dtype: object

Now when we compute the sums directly, Pandas first turns everything into float64s. The computation is also done using the float64 type, for all three columns, and we get an accurate answer.

>>> df.sum()
A    100000000.0
B    100000000.0
C    100000000.0
dtype: float64

But if we first cut down our dataframe to just the float32 columns, then float32-arithmetic is used for the sum, and we get very poor answers.

>>> df[['A', 'B']].sum()
A    16777216.0
B    16777216.0
dtype: float32

The inaccuracy is of course due to using a dtype that doesn't have enough precision for the task in question: at some point in the summation, we end up repeatedly adding 1.0 to 16777216.0, and getting 16777216.0 back each time, thanks to the usual floating-point problems. The solution is to explicitly convert to float64 yourself before doing the computation.

However, this isn't quite the end of the surprises that Pandas has in store for us. With the same dataframe as above, let's try just computing the sum for column "A":

>>> df[['A']].sum()
A    100000000.0
dtype: float32

Suddenly we're getting full accuracy again! So what's going on? This has little to do with dtypes: we're still using float32 to do the summation. It's now the second step (the NumPy summation) that's responsible for the difference. What's happening is that NumPy can, and sometimes does, use a more accurate summation algorithm, called pairwise summation, and with float32 dtype and the size arrays that we're using, that accuracy can make a hugely significant difference to the final result. However, it only uses that algorithm when summing along the fastest-varying axis of an array; see this NumPy issue for related discussion. In the case where we compute the sum of both column "A" and column "B", we end up with a values array of shape (100000000, 2). The fastest-varying axis is axis 1, and we're computing the sum along axis 0, so the naive summation algorithm is used and we get poor results. But if we only ask for the sum of column "A", we get the accurate sum result, computed using pairwise summation.

In sum, when working with DataFrames of this size, you want to be careful to (a) work with double precision rather than single precision whenever possible, and (b) be prepared for differences in output results due to NumPy making different algorithm choices.

like image 58
Mark Dickinson Avatar answered Feb 15 '23 06:02

Mark Dickinson