Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas groupby-apply behavior, returning a Series (inconsistent output type)

Tags:

python

pandas

I'm curious about the behavior of pandas groupby-apply when the apply function returns a series.

When the series are of different lengths, it returns a multi-indexed series.

In [1]: import pandas as pd

In [2]: df1=pd.DataFrame({'state':list("AABBB"),
   ...:                 'city':list("vwxyz")})

In [3]: df1
Out[3]:
  city state
0    v     A
1    w     A
2    x     B
3    y     B
4    z     B

In [4]: def f(x):
   ...:         return pd.Series(x['city'].values,index=range(len(x)))
   ...:

In [5]: df1.groupby('state').apply(f)
Out[5]:
state
A      0    v
       1    w
B      0    x
       1    y
       2    z
dtype: object

This returns a a Series object.

However, if every series has the same length, then it pivots this into a DataFrame.

In [6]: df2=pd.DataFrame({'state':list("AAABBB"),
   ...:                 'city':list("uvwxyz")})

In [7]: df2
Out[7]:
  city state
0    u     A
1    v     A
2    w     A
3    x     B
4    y     B
5    z     B

In [8]: df2.groupby('state').apply(f)
Out[8]:
       0  1  2
state
A      u  v  w
B      x  y  z

Is this really the intended behavior? Are we meant to check the return type if we use apply this way? Or is there an option in apply that I'm not appreciating?

In case you're curious, in my actual use case, the returned Series will be the same length as the length of the group. It seems like an ideal case for transform except that I've found that apply with returning a Series is actually an order of magnitude faster on a large dataset. That can be another topic.

Edit: Loosely based on the Parfait's answer, we can certainly do this:

X=df.groupby('state').apply(f)
if not isinstance(X,pd.Series):
    X=X.stack()
X

That will give the same output type for either df=df1 or df=df2. I guess I'm just asking if this is really the normal or preferred way to handle this.

like image 594
Victor Chubukov Avatar asked Jun 09 '16 00:06

Victor Chubukov


2 Answers

In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combine docs, running a groupby() refers to one or more of the following

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.

For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A group having length 2 and B group length 3.

def f(x):
    print(x)
    return pd.Series(x['city'].values, index=range(len(x)))

s1 = df1.groupby('state').apply(f)

print(s1)
#   city state
# 0    v     A
# 1    w     A
#   city state
# 0    v     A
# 1    w     A
#   city state
# 2    x     B
# 3    y     B
# 4    z     B
# state   
# A      0    v
#        1    w
# B      0    x
#        1    y
#        2    z
# dtype: object

However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:

df = df1.groupby('state').apply(f).reset_index()
print(df)

#   state  level_1  0
# 0     A        0  v
# 1     A        1  w
# 2     B        0  x
# 3     B        1  y
# 4     B        2  z

But more relevant to your needs is unstack() which pivots a level of the index labels, yielding a data frame. Consider fillna() to fill the None outcome.

df = df1.groupby('state').apply(f).unstack()
print(df)

#        0  1     2
# state            
# A      v  w  None
# B      x  y     z
like image 164
Parfait Avatar answered Sep 19 '22 14:09

Parfait


instead of doing index=range(len(x)) in your function f, you can do index=x.index to prevent this undesired behavior

like image 42
user3582076 Avatar answered Sep 20 '22 14:09

user3582076