Python: Pandas wrongly excluding column in groupby

Tags:

python

pandas

I have come across the Pandas' silent exclusion of nuisance columns as explained here:Pandas Nuisance columns

It claims that it silently excludes columns if the aggregate function cannot be applied to the column.

Consider the following example:

I have a data frame:

df = pd.DataFrame({'C': {0: -0.91985400000000006, 1: -0.042379, 2: 1.2476419999999999, 3: -0.00992, 4: 0.290213, 5: 0.49576700000000001, 6: 0.36294899999999997, 7: 1.548106}, 'A': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'foo'}, 'B': {0: -1.131345, 1: -0.089328999999999992, 2: 0.33786300000000002, 3: -0.94586700000000001, 4: -0.93213199999999996, 5: 1.9560299999999999, 6: 0.017587000000000002, 7: -0.016691999999999999}})

df:
     A      B           C
0   foo -1.131345   -0.919854
1   bar -0.089329   -0.042379
2   foo 0.337863    1.247642
3   bar -0.945867   -0.009920
4   foo -0.932132   0.290213
5   bar 1.956030    0.495767
6   foo 0.017587    0.362949
7   foo -0.016692   1.548106

Let me combine two columns B and C and convert into a numpy ndarray:

df = df.assign(D=df[['B', 'C']].values.tolist())
df['D'] = df['D'].apply(np.array)

df:

     A       B          C                   D
0   foo -1.131345   -0.919854   [-1.131345, -0.9198540000000001]
1   bar -0.089329   -0.042379   [-0.08932899999999999, -0.042379]
2   foo 0.337863    1.247642    [0.337863, 1.247642]
3   bar -0.945867   -0.009920   [-0.945867, -0.00992]
4   foo -0.932132   0.290213    [-0.932132, 0.290213]
5   bar 1.956030    0.495767    [1.95603, 0.495767]
6   foo 0.017587    0.362949    [0.017587000000000002, 0.36294899999999997]
7   foo -0.016692   1.548106    [-0.016692, 1.548106]

Now i can apply mean to column D:

print(df['D'].mean())
print(df['B'].mean())
print(df['C'].mean())

[-0.10048563  0.3715655 ]
-0.100485625
0.3715655

But when i try to groupby A and get the mean, column D is getting dropped:

df.groupby('A').mean()

        B         C
 A      
bar  0.306945   0.147823
foo  -0.344944  0.505811

My Question is, why is column D getting excluded even though, the aggregate function can be successfully applied?

And also, in general how do i use aggregate functions like mean or sum when a particular column of interest is a numpy array?

705

asked Feb 22 '18 08:02

Vikash Balasubramanian

1 Answers

Is it possible, but need if-else in custom function:

def f(x):
    a = x.mean()
    return a if isinstance(a, (float, int)) else list(a)

df = df.groupby('A').agg(f)
print (df)
            B         C                                 D
A                                                        
bar  0.306945  0.147823  [0.306944666667, 0.147822666667]
foo -0.344944  0.505811           [-0.3449438, 0.5058112]

df = df.groupby('A').agg(lambda x: x.mean())
print (df)
            B         C   D
A                          
bar  0.306945  0.147823 NaN
foo -0.344944  0.505811 NaN

answered Sep 27 '22 17:09

jezrael

Related questions
                            
                                Why does datetime.utcnow() return a naive datetime? [duplicate]
                            
                                Which objects in Python can have attributes added dynamically?
                            
                                How to make request without blocking (using asyncio)?
                            
                                How to approximate the determinant with keras
                            
                                Azure Batch Pool: How do I use a custom VM Image via Python?
                            
                                Display an unknown number of fields in Django template with field content of another record as the label
                            
                                Workaround for Google Earth Engine Python API and no support for `ee.mapclient` in Python 3
                            
                                Django user model extension in an ecommerce application
                            
                                How to get file path + file name into a list? [duplicate]
                            
                                pandas.eval with a boolean series with missing data
                            
                                Scikit image: resize() got an unexpected keyword argument 'anti_aliasing'
                            
                                numpy array indexing with lists and arrays
                            
                                Converting embedded Excel objects from a docx file into images
                            
                                Is it possible to split a Jupyter cell across cells when it contains a function, loop, or other block?
                            
                                gRPC: Rendezvous terminated with (StatusCode.INTERNAL, Received RST_STREAM with error code 2)
                            
                                Python PIL: font weight and style
                            
                                HDBSCAN Python choose number of clusters
                            
                                How to convert a spectrogram to 3d plot. Python
                            
                                Python PANDAS: Converting from pandas/numpy to dask dataframe/array
                            
                                Can't verify hashes for these requirements because we don't have a way to hash version control repositories

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With