Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistency in results of aggregating pandas groupby object using numpy.median vs other functions

Using DataFrame (pandas as pd, numpy as np):

test = pd.DataFrame({'A' : [10,11,12,13,15,25,43,70],  
                     'B' : [1,2,3,4,5,6,7,8],  
                     'C' : [1,1,1,1,2,2,2,2]})


In [39]: test
Out[39]: 
    A  B  C
0  10  1  1
1  11  2  1
2  12  3  1
3  13  4  1
4  15  5  2
5  25  6  2
6  43  7  2
7  70  8  2

Grouping DF by 'C' and aggregating with np.mean (also sum, min, max) produces column-wise aggregation within groups:

In [40]: test_g = test.groupby('C')

In [41]: test_g.aggregate(np.mean)
Out[41]: 
       A    B
C            
1  11.50  2.5
2  38.25  6.5

However, it looks like aggregating using np.median produces DataFrame-wise aggregation within groups:

In [42]: test_g.aggregate(np.median)
Out[42]: 
      A     B
C            
1   7.0   7.0
2  11.5  11.5

(using groupby.median method seems to produce expected column-wise results though)

I would appreciate addressing following issues:

  1. What is the reason/mechanism of such an outcome?
  2. If this behaviour is confirmed, how does it affect recommended "best practices" of aggregating groupings? Could other aggregation functions work this way?
like image 320
LukaszJ Avatar asked Sep 29 '12 09:09

LukaszJ


People also ask

How does groupby work in pandas?

Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.

How to aggregate data in pandas?

Pandas also comes with an additional method, .agg (), which allows us to apply multiple aggregations in the .groupby () method. The method allows us to pass in a list of callables (i.e., the function part without the parentheses). Let’s see how we can apply some of the functions that come with the numpy library to aggregate our data.

How to group data by multiple columns in a pandas Dataframe?

We can extend the functionality of the Pandas .groupby () method even further by grouping our data by multiple columns. So far, you’ve grouped the DataFrame only by a single column, by passing in a string representing the column. However, you can also pass in a list of strings that represent the different columns.

Why is it so hard to inspect a pandas groupby object?

It can be difficult to inspect df.groupby ("state") because it does virtually none of these things until you do something with the resulting object. Again, a Pandas GroupBy object is lazy. It delays virtually every part of the split-apply-combine process until you invoke a method on it.


1 Answers

The reason is quite funny. Probably some pandas specialists would want to chime in, but it comes down to a ping-pong between numpy and pandas. Note that the documentation says:

Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If pass a dict, the keys must be DataFrame column names

The first thing is a 2D (array_like) the second method comes down to 1D array_likes being passed to the function you give in.

This means aggregate passes first the 2D series in. In the first case (np.mean), numpy knows that arrays have a .mean attribute, so it does what it always does it calls this. However it calls it with axis=None (default for numpy). This makes Pandas throw an Exception (it wants axis to be 0 or 1 and never None) and it goes to the second step, which passes it as 1D and is foolproof.

However, when you give in np.median numpy arrays do not have the .median attribute, so it does the normal numpy machinery, which is to flatten the array (ie, typically axis=None).

The workaround would be to use test_g.aggregate([np.median, np.median]) to force it to always take the second path. or what would work too: test_g.aggregate(np.median, axis=0) which passes the axis=0 on into np.median and thus tells numpy how to handle it correctly. In generally I wonder if pandas should not at least throw a warning, afterall broadcasting the result to both columns should be almost never what is wanted.

like image 193
seberg Avatar answered Oct 22 '22 12:10

seberg