Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unclear why groupby with single group produces row DataFrame

Tags:

python

pandas

Here's two groupby operations on a pandas.DataFrame:

import pandas


d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
                      "b": [1, 2, 4, 3, -1, 5]})

grp1 = pandas.Series([1, 1, 1, 1, 1, 1])
ans1 = d.groupby(grp1).apply(lambda x: x.a * x.b.iloc[0])

grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
ans2 = d.groupby(grp2).apply(lambda x: x.a * x.b.iloc[0])

print(ans1.reset_index(drop=True))
# a  0  1  2  3  4  5
# 0  1  2  3  4  5  6

print(ans2.reset_index(drop=True))
# 0     1
# 1     2
# 2     3
# 3    12
# 4    15
# 5    18
# Name: a, dtype: int64

I want the output in the format of ans2. If the grouping Series has more than one group (as in grp2), then there is no issue with the output format. However, when grouping Series has only one group (as in grp1), the output is a DataFrame with a single row. Why is this?

How can I ensure that the output will always be like ans2 regardless of the number of groups in the grouping Series? Is there a quicker/better approach than

  1. Checking if the output is a DataFrame and coercing into a Series
  2. Checking if the grouping Series has only one group and avoiding groupby if that's the case
like image 409
d.b Avatar asked Sep 08 '21 21:09

d.b


People also ask

Does Groupby return a Dataframe or series?

So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.

What is the result of Groupby in Pandas?

Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Does Groupby sort data Pandas?

To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.

How do you select rows in Groupby?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

What is groupby in pandas Dataframe?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.

Can a groupby () operation be downcast to a Dataframe?

Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby () operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.

What is grouping in a Dataframe?

The abstract definition of grouping is to provide a mapping of labels to group names. Syntax: DataFrame.groupby (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

How to perform aggregation of dataframegroupby?

Once DataFrameGroupBy has been created, several methods are available to perform a computation on the grouped data. An obvious one is to perform aggregation - compute a summary statistic for each group. There is a method called agg () and it allows us to specify multiple aggregation functions at once.


Video Answer


2 Answers

A simple solution is to return a DataFrame from apply:

import pandas


d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
                      "b": [1, 2, 4, 3, -1, 5]})


grp1 = pandas.Series([1, 1, 1, 1, 1, 1])


ans1 = d.groupby(grp1).apply(lambda x: x[['a']] * x.b.iloc[0])

grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
ans2 = d.groupby(grp2).apply(lambda x: x[['a']] * x.b.iloc[0])

print(ans1.reset_index(drop=True))
#    a
# 0  1
# 1  2
# 2  3
# 3  4
# 4  5
# 5  6

print(ans2.reset_index(drop=True))
#     a
# 0   1
# 1   2
# 2   3
# 3  12
# 4  15
# 5  18

To understand why, the documentation of apply function is helpful. When the function given to apply returns a Series they are converted to a row and final output is a DataFrame with one row per group. So the behaviour of grp1 is actually expected.

This begs the question why does the second case using grp2 return a Series. I think that is because the two groups return Series with different index values. Thus the results of the two groups are appended in a single row with multi-level indexing (as seen below).

d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
                      "b": [1, 2, 4, 3, -1, 5]})

grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
def func(x):
    z= x.a * x.b.iloc[0]
    print(z.index)
    return z
ans2 = d.groupby(grp2).apply(func)
# Int64Index([0, 1, 2], dtype='int64')
# Int64Index([3, 4, 5], dtype='int64')

print(ans2)
# 1  0     1
#    1     2
#    2     3
# 2  3    12
#    4    15
#    5    18
# Name: a, dtype: int64
like image 78
tihom Avatar answered Sep 26 '22 02:09

tihom


I think the easiest is to avoid .apply() which indeed do weird things when recombining. This is probably because the semantics of this function are so vague. You can return anything and pandas will do its best to guess what you meant

If you want consistent results with functions that apply to the whole sub-dataframe you’re better off running the function yourself:

>>> pd.concat({n: (lambda x: x.a * x.b.iloc[0])(g) for n, g in d.groupby(grp1)})
1  0    1
   1    2
   2    3
   3    4
   4    5
   5    6
Name: a, dtype: int64
>>> pd.concat({n: (lambda x: x.a * x.b.iloc[0])(g) for n, g in d.groupby(grp2)})
1  0     1
   1     2
   2     3
2  3    12
   4    15
   5    18
Name: a, dtype: int64

Now what I would recommend is instead to use a function with a well defined return shape. Here .transform() could be of use:

>>> d.groupby(grp1)['b'].transform('first')
0    1
1    1
2    1
3    1
4    1
5    1
Name: b, dtype: int64
>>> d.groupby(grp2)['b'].transform('first')
0    1
1    1
2    1
3    3
4    3
5    3
Name: b, dtype: int64

Here’s an example of how you could use for the same calculation:

>>> ans1 = d.copy()
>>> ans1['a'] *= d.groupby(grp1)['b'].transform('first')
>>> ans1
   a  b
0  1  1
1  2  2
2  3  4
3  4  3
4  5 -1
5  6  5
>>> ans2 = d.copy()
>>> ans2['a'] *= d.groupby(grp2)['b'].transform('first')
>>> ans2
    a  b
0   1  1
1   2  2
2   3  4
3  12  3
4  15 -1
5  18  5
like image 23
Cimbali Avatar answered Sep 24 '22 02:09

Cimbali