I have come across the Pandas' silent exclusion of nuisance columns as explained here:Pandas Nuisance columns
It claims that it silently excludes columns if the aggregate function cannot be applied to the column.
Consider the following example:
I have a data frame:
df = pd.DataFrame({'C': {0: -0.91985400000000006, 1: -0.042379, 2: 1.2476419999999999, 3: -0.00992, 4: 0.290213, 5: 0.49576700000000001, 6: 0.36294899999999997, 7: 1.548106}, 'A': {0: 'foo', 1: 'bar', 2: 'foo', 3: 'bar', 4: 'foo', 5: 'bar', 6: 'foo', 7: 'foo'}, 'B': {0: -1.131345, 1: -0.089328999999999992, 2: 0.33786300000000002, 3: -0.94586700000000001, 4: -0.93213199999999996, 5: 1.9560299999999999, 6: 0.017587000000000002, 7: -0.016691999999999999}})
df:
A B C
0 foo -1.131345 -0.919854
1 bar -0.089329 -0.042379
2 foo 0.337863 1.247642
3 bar -0.945867 -0.009920
4 foo -0.932132 0.290213
5 bar 1.956030 0.495767
6 foo 0.017587 0.362949
7 foo -0.016692 1.548106
Let me combine two columns B and C and convert into a numpy ndarray:
df = df.assign(D=df[['B', 'C']].values.tolist())
df['D'] = df['D'].apply(np.array)
df:
A B C D
0 foo -1.131345 -0.919854 [-1.131345, -0.9198540000000001]
1 bar -0.089329 -0.042379 [-0.08932899999999999, -0.042379]
2 foo 0.337863 1.247642 [0.337863, 1.247642]
3 bar -0.945867 -0.009920 [-0.945867, -0.00992]
4 foo -0.932132 0.290213 [-0.932132, 0.290213]
5 bar 1.956030 0.495767 [1.95603, 0.495767]
6 foo 0.017587 0.362949 [0.017587000000000002, 0.36294899999999997]
7 foo -0.016692 1.548106 [-0.016692, 1.548106]
Now i can apply mean to column D:
print(df['D'].mean())
print(df['B'].mean())
print(df['C'].mean())
[-0.10048563 0.3715655 ]
-0.100485625
0.3715655
But when i try to groupby A and get the mean, column D is getting dropped:
df.groupby('A').mean()
B C
A
bar 0.306945 0.147823
foo -0.344944 0.505811
My Question is, why is column D getting excluded even though, the aggregate function can be successfully applied?
And also, in general how do i use aggregate functions like mean or sum when a particular column of interest is a numpy array?
We can exclude one column from the pandas dataframe by using the loc function. This function removes the column based on the location. Here we will be using the loc() function with the given data frame to exclude columns with name,city, and cost in python.
Using reset_index() function Pandas provide a function called reset_index() to flatten the hierarchical index created due to the groupby aggregation function in Python. Parameters: level – removes only the specified levels from the index. drop – resets the index to the default integer index.
From the docs: "NA groups in GroupBy are automatically excluded".
Is it possible, but need if-else
in custom function:
def f(x):
a = x.mean()
return a if isinstance(a, (float, int)) else list(a)
df = df.groupby('A').agg(f)
print (df)
B C D
A
bar 0.306945 0.147823 [0.306944666667, 0.147822666667]
foo -0.344944 0.505811 [-0.3449438, 0.5058112]
df = df.groupby('A').agg(lambda x: x.mean())
print (df)
B C D
A
bar 0.306945 0.147823 NaN
foo -0.344944 0.505811 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With