Inconsistency in results of aggregating pandas groupby object using numpy.median vs other functions

Tags:

Using DataFrame (pandas as pd, numpy as np):

test = pd.DataFrame({'A' : [10,11,12,13,15,25,43,70],  
                     'B' : [1,2,3,4,5,6,7,8],  
                     'C' : [1,1,1,1,2,2,2,2]})


In [39]: test
Out[39]: 
    A  B  C
0  10  1  1
1  11  2  1
2  12  3  1
3  13  4  1
4  15  5  2
5  25  6  2
6  43  7  2
7  70  8  2

Grouping DF by 'C' and aggregating with np.mean (also sum, min, max) produces column-wise aggregation within groups:

In [40]: test_g = test.groupby('C')

In [41]: test_g.aggregate(np.mean)
Out[41]: 
       A    B
C            
1  11.50  2.5
2  38.25  6.5

However, it looks like aggregating using np.median produces DataFrame-wise aggregation within groups:

In [42]: test_g.aggregate(np.median)
Out[42]: 
      A     B
C            
1   7.0   7.0
2  11.5  11.5

(using groupby.median method seems to produce expected column-wise results though)

I would appreciate addressing following issues:

What is the reason/mechanism of such an outcome?
If this behaviour is confirmed, how does it affect recommended "best practices" of aggregating groupings? Could other aggregation functions work this way?

320

asked Sep 29 '12 09:09

LukaszJ

1 Answers

The reason is quite funny. Probably some pandas specialists would want to chime in, but it comes down to a ping-pong between numpy and pandas. Note that the documentation says:

Function to use for aggregating groups. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If pass a dict, the keys must be DataFrame column names

The first thing is a 2D (array_like) the second method comes down to 1D array_likes being passed to the function you give in.

This means aggregate passes first the 2D series in. In the first case (np.mean), numpy knows that arrays have a .mean attribute, so it does what it always does it calls this. However it calls it with axis=None (default for numpy). This makes Pandas throw an Exception (it wants axis to be 0 or 1 and never None) and it goes to the second step, which passes it as 1D and is foolproof.

However, when you give in np.median numpy arrays do not have the .median attribute, so it does the normal numpy machinery, which is to flatten the array (ie, typically axis=None).

The workaround would be to use test_g.aggregate([np.median, np.median]) to force it to always take the second path. or what would work too: test_g.aggregate(np.median, axis=0) which passes the axis=0 on into np.median and thus tells numpy how to handle it correctly. In generally I wonder if pandas should not at least throw a warning, afterall broadcasting the result to both columns should be almost never what is wanted.

193

answered Oct 22 '22 12:10

seberg

Related questions
                            
                                Pull Tag Value using BeautifulSoup
                            
                                TypeError: function() argument after * must be a sequence, not generator
                            
                                Have Sphinx replace docstring text
                            
                                Comparing dateutil.relativedelta
                            
                                Generate a pdf with python
                            
                                Writing to a File with Python -- ''While not done:" Confusing Me
                            
                                Formatting a nan float in python
                            
                                text searching with whoosh
                            
                                Subprocess.poll() falsely returns a value
                            
                                how to select only some columns in SQLAlchemy?
                            
                                Multiple substitutions of numbers in string using regex python
                            
                                Disable SSL certificate validation in Python
                            
                                Why does this string not work with ast.literal_eval
                            
                                Verify signature with pyopenssl
                            
                                How to get rid of maximum recursion depth error or better solve this?
                            
                                Calculate speed from timestamped positions in Pandas.DataFrame
                            
                                Twisted clients within pygame mainloop?
                            
                                Flask - nested rest api - use something other than methodview or have I made a bad design?
                            
                                Python and Matplotlib and Annotations with Mouse Hover
                            
                                Deploying Flask, parallel requests

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Inconsistency in results of aggregating pandas groupby object using numpy.median vs other functions

Tags:

python

pandas

aggregate

numpy

LukaszJ

People also ask

1 Answers

seberg

Recent Activity

Donate For Us