I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong. I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
The GROUP BY statement groups rows that have the same values into summary rows, like "find the number of customers in each country". The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.
agg. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-combine strategy. Splitting the data into groups based on some criteria. Applying a function to each group independently. Combining the results into a data structure. We use groupby () function to group the data on “Maths” value.
It's possible in Pandas to define your own aggfunc and use it with a groupby method. In the next example we will define a function which will compute the NaN values in each group: Finally let's check how to use aggregation functions with groupby from scipy or numpy
As you can see in these examples it is super easy and straight forward to use groupby and aggregate functions together. The rules are to use groupby function to create groupby object first and then call an aggregate function to compute information for each group.
As the name suggests it should group your data into groups. In this case, it will group it into three groups representing different flower species (our target values). As you can see the groupby () function returns a DataFrameGroupBy object. Not very useful at first glance.
Use crosstab
:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies
with Groupby.sum
:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat
with pd.get_dummies
:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With