I'm curious about the behavior of pandas groupby-apply when the apply function returns a series.
When the series are of different lengths, it returns a multi-indexed series.
In [1]: import pandas as pd
In [2]: df1=pd.DataFrame({'state':list("AABBB"),
...: 'city':list("vwxyz")})
In [3]: df1
Out[3]:
city state
0 v A
1 w A
2 x B
3 y B
4 z B
In [4]: def f(x):
...: return pd.Series(x['city'].values,index=range(len(x)))
...:
In [5]: df1.groupby('state').apply(f)
Out[5]:
state
A 0 v
1 w
B 0 x
1 y
2 z
dtype: object
This returns a a Series
object.
However, if every series has the same length, then it pivots this into a DataFrame
.
In [6]: df2=pd.DataFrame({'state':list("AAABBB"),
...: 'city':list("uvwxyz")})
In [7]: df2
Out[7]:
city state
0 u A
1 v A
2 w A
3 x B
4 y B
5 z B
In [8]: df2.groupby('state').apply(f)
Out[8]:
0 1 2
state
A u v w
B x y z
Is this really the intended behavior? Are we meant to check the return type if we use apply this way? Or is there an option in apply
that I'm not appreciating?
In case you're curious, in my actual use case, the returned Series will be the same length as the length of the group. It seems like an ideal case for transform
except that I've found that apply
with returning a Series is actually an order of magnitude faster on a large dataset. That can be another topic.
Edit: Loosely based on the Parfait's answer, we can certainly do this:
X=df.groupby('state').apply(f)
if not isinstance(X,pd.Series):
X=X.stack()
X
That will give the same output type for either df=df1
or df=df2
. I guess I'm just asking if this is really the normal or preferred way to handle this.
In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combine docs, running a groupby() refers to one or more of the following
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby()
operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.
For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A
group having length 2 and B
group length 3.
def f(x):
print(x)
return pd.Series(x['city'].values, index=range(len(x)))
s1 = df1.groupby('state').apply(f)
print(s1)
# city state
# 0 v A
# 1 w A
# city state
# 0 v A
# 1 w A
# city state
# 2 x B
# 3 y B
# 4 z B
# state
# A 0 v
# 1 w
# B 0 x
# 1 y
# 2 z
# dtype: object
However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:
df = df1.groupby('state').apply(f).reset_index()
print(df)
# state level_1 0
# 0 A 0 v
# 1 A 1 w
# 2 B 0 x
# 3 B 1 y
# 4 B 2 z
But more relevant to your needs is unstack() which pivots a level of the index labels, yielding a data frame. Consider fillna()
to fill the None
outcome.
df = df1.groupby('state').apply(f).unstack()
print(df)
# 0 1 2
# state
# A v w None
# B x y z
instead of doing index=range(len(x))
in your function f, you
can do index=x.index
to prevent this undesired behavior
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With