In pandas, how can I add a new column which enumerates rows based on a given grouping?
For instance, assume the following DataFrame:
import pandas as pd
import numpy as np
a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
col_a col_b
0 A 0
1 B 1
2 C 2
3 A 3
4 A 4
5 C 5
6 B 6
7 B 7
8 A 8
9 C 9
I'd like to add a col_c
that gives me the Nth row of the "group" based on a grouping of col_a
and sorting of col_b
.
Desired output:
col_a col_b col_c
0 A 0 1
3 A 3 2
4 A 4 3
8 A 8 4
1 B 1 1
6 B 6 2
7 B 7 3
2 C 2 1
5 C 5 2
9 C 9 3
I'm struggling to get to col_c
. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b'])
, it's now a matter of getting to that new column and labeling each row.
In above example, we'll use the function groups. get_group() to get all the groups. First we'll get all the keys of the group and then iterate through that and then calling get_group() method for each key. get_group() method will return group corresponding to the key.
How to perform groupby index in pandas? Pass index name of the DataFrame as a parameter to groupby() function to group rows on an index. DataFrame. groupby() function takes string or list as a param to specify the group columns or index.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
Select All Except One Column Using drop() Method in pandas You can also acheive selecting all columns except one column by deleting the unwanted column using drop() method. Note that drop() is also used to drop rows from pandas DataFrame. In order to remove columns use axis=1 or columns param. For example df.
There's cumcount, for precisely this case:
df['col_c'] = g.cumcount()
As it says in the docs:
Number each item in each group from 0 to the length of that group - 1.
Original answer (before cumcount was defined).
You could create a helper function to do this:
def add_col_c(x):
x['col_c'] = np.arange(len(x))
return x
First sort by column col_a:
In [11]: df.sort('col_a', inplace=True)
then apply this function across each group:
In [12]: g = df.groupby('col_a', as_index=False)
In [13]: g.apply(add_col_c)
Out[13]:
col_a col_b col_c
3 A 3 0
8 A 8 1
0 A 0 2
4 A 4 3
6 B 6 0
1 B 1 1
7 B 7 2
9 C 9 0
2 C 2 1
5 C 5 2
In order to get 1,2,...
you couls use np.arange(1, len(x) + 1)
.
The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).
Here is my pure numpy suggestion:
In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
col_a col_b col_c
9 C 9 0
5 C 5 1
2 C 2 2
7 B 7 0
6 B 6 1
1 B 1 2
8 A 8 0
4 A 4 1
3 A 3 2
0 A 0 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With