Here's two groupby
operations on a pandas.DataFrame
:
import pandas
d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
"b": [1, 2, 4, 3, -1, 5]})
grp1 = pandas.Series([1, 1, 1, 1, 1, 1])
ans1 = d.groupby(grp1).apply(lambda x: x.a * x.b.iloc[0])
grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
ans2 = d.groupby(grp2).apply(lambda x: x.a * x.b.iloc[0])
print(ans1.reset_index(drop=True))
# a 0 1 2 3 4 5
# 0 1 2 3 4 5 6
print(ans2.reset_index(drop=True))
# 0 1
# 1 2
# 2 3
# 3 12
# 4 15
# 5 18
# Name: a, dtype: int64
I want the output in the format of ans2
. If the grouping Series has more than one group (as in grp2
), then there is no issue with the output format. However, when grouping Series has only one group (as in grp1
), the output is a DataFrame
with a single row. Why is this?
How can I ensure that the output will always be like ans2
regardless of the number of groups in the grouping Series? Is there a quicker/better approach than
groupby
if that's the caseSo a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.
Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
To group Pandas dataframe, we use groupby(). To sort grouped dataframe in ascending or descending order, use sort_values(). The size() method is used to get the dataframe size.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.
Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby () operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.
The abstract definition of grouping is to provide a mapping of labels to group names. Syntax: DataFrame.groupby (by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)
Once DataFrameGroupBy has been created, several methods are available to perform a computation on the grouped data. An obvious one is to perform aggregation - compute a summary statistic for each group. There is a method called agg () and it allows us to specify multiple aggregation functions at once.
A simple solution is to return a DataFrame
from apply
:
import pandas
d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
"b": [1, 2, 4, 3, -1, 5]})
grp1 = pandas.Series([1, 1, 1, 1, 1, 1])
ans1 = d.groupby(grp1).apply(lambda x: x[['a']] * x.b.iloc[0])
grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
ans2 = d.groupby(grp2).apply(lambda x: x[['a']] * x.b.iloc[0])
print(ans1.reset_index(drop=True))
# a
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
# 5 6
print(ans2.reset_index(drop=True))
# a
# 0 1
# 1 2
# 2 3
# 3 12
# 4 15
# 5 18
To understand why, the documentation of apply
function is helpful. When the function given to apply
returns a Series
they are converted to a row and final output is a DataFrame
with one row per group. So the behaviour of grp1
is actually expected.
This begs the question why does the second case using grp2
return a Series
. I think that is because the two groups return Series
with different index values. Thus the results of the two groups are appended in a single row with multi-level indexing (as seen below).
d = pandas.DataFrame({"a": [1, 2, 3, 4, 5, 6],
"b": [1, 2, 4, 3, -1, 5]})
grp2 = pandas.Series([1, 1, 1, 2, 2, 2])
def func(x):
z= x.a * x.b.iloc[0]
print(z.index)
return z
ans2 = d.groupby(grp2).apply(func)
# Int64Index([0, 1, 2], dtype='int64')
# Int64Index([3, 4, 5], dtype='int64')
print(ans2)
# 1 0 1
# 1 2
# 2 3
# 2 3 12
# 4 15
# 5 18
# Name: a, dtype: int64
I think the easiest is to avoid .apply()
which indeed do weird things when recombining. This is probably because the semantics of this function are so vague. You can return anything and pandas will do its best to guess what you meant
If you want consistent results with functions that apply to the whole sub-dataframe you’re better off running the function yourself:
>>> pd.concat({n: (lambda x: x.a * x.b.iloc[0])(g) for n, g in d.groupby(grp1)})
1 0 1
1 2
2 3
3 4
4 5
5 6
Name: a, dtype: int64
>>> pd.concat({n: (lambda x: x.a * x.b.iloc[0])(g) for n, g in d.groupby(grp2)})
1 0 1
1 2
2 3
2 3 12
4 15
5 18
Name: a, dtype: int64
Now what I would recommend is instead to use a function with a well defined return shape. Here .transform()
could be of use:
>>> d.groupby(grp1)['b'].transform('first')
0 1
1 1
2 1
3 1
4 1
5 1
Name: b, dtype: int64
>>> d.groupby(grp2)['b'].transform('first')
0 1
1 1
2 1
3 3
4 3
5 3
Name: b, dtype: int64
Here’s an example of how you could use for the same calculation:
>>> ans1 = d.copy()
>>> ans1['a'] *= d.groupby(grp1)['b'].transform('first')
>>> ans1
a b
0 1 1
1 2 2
2 3 4
3 4 3
4 5 -1
5 6 5
>>> ans2 = d.copy()
>>> ans2['a'] *= d.groupby(grp2)['b'].transform('first')
>>> ans2
a b
0 1 1
1 2 2
2 3 4
3 12 3
4 15 -1
5 18 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With