Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Bug?: Mean of an grouped-by int64 column stays as int64 in some circumstances

Tags:

python

pandas

csv

I am finding a very strange (IMHO) behaviour with some data loaded into pandas from a CSV file. To protect the innocent, let's state that the DataFrame is in the variable homes and, among others, has the columns below:

In [143]: homes[['zipcode', 'sqft', 'price']].dtypes
Out[143]:
zipcode     int64
sqft        int64
price       int64
dtype: object

To get the average price in each zipcode, I tried:

In [146]: homes.groupby('zipcode')[['price']].mean().head(n=5)
Out[146]:
           price
zipcode
28001     280804
28002     234284
28003     294111
28004    1355927
28005     810164

Strangely enough, the price mean is an int64 as shown by:

In [147]: homes.groupby('zipcode')[['price']].mean().dtypes
Out[147]:
price    int64
dtype: object

I am not able to imagine any technical reason why the mean of some ints is not promoted to float. Even more, just adding another column, makes the price to become a float64 as I expected it to be all the time:

In [148]: homes.groupby('zipcode')[['price', 'sqft']].mean().dtypes
Out[148]:
price       float64
sqft        float64
dtype: object

                  price          sqft
zipcode
28001     280804.690608  14937.450276
28002     234284.035176   7517.633166
28003     294111.278571  10603.096429
28004    1355927.097792  13104.220820
28005     810164.880952  19928.785714

To ensure I was not missing something very obvious, I created another very simple DataFrame (df) but, with this one, this behaviour is not appearing:

In [161]: df[['J','K']].dtypes
Out[161]:
J    int64
K    int64
dtype: object

In [164]: df[['J','K']].head(n=10)
Out[164]:
   J   K
0  0  -9
1  0 -14
2  0   8
3  0 -11
4  0  -7
5 -1   7
6  0   2
7  0   0
8  0   5
9  0   3

In [165]: df.groupby('J')[['K']].mean()
Out[165]:
           K
J
-2 -2.333333
-1  0.466667
 0 -1.030303
 1 -1.750000
 2 -3.000000

Please, note that with a single column, K:int64, grouped by J, another int64, the mean is directly a float. The homes DataFrame was read from a supplied CSV file, the df one has been created in pandas, written into a CSV and then read back.

Last but not least, I am using pandas 0.16.2.

like image 240
c-garcia Avatar asked Oct 31 '22 18:10

c-garcia


1 Answers

As suggested by some of you in the comments, this is a bug in pandas. I have just reported it here.

As of now, it has been accepted by the pandas team.

Thanks

like image 115
c-garcia Avatar answered Nov 02 '22 23:11

c-garcia