Groupby value counts on the dataframe pandas

Question

I have the following dataframe:

df = pd.DataFrame([     (1, 1, 'term1'),     (1, 2, 'term2'),     (1, 1, 'term1'),     (1, 1, 'term2'),     (2, 2, 'term3'),     (2, 3, 'term1'),     (2, 2, 'term1') ], columns=['id', 'group', 'term'])

I want to group it by id and group and calculate the number of each term for this id, group pair.

So in the end I am going to get something like this:

enter image description here

I was able to achieve what I want by looping over all the rows with df.iterrows() and creating a new dataframe, but this is clearly inefficient. (If it helps, I know the list of all terms beforehand and there are ~10 of them).

It looks like I have to group by and then count values, so I tried that with df.groupby(['id', 'group']).value_counts() which does not work because value_counts operates on the groupby series and not a dataframe.

Anyway I can achieve this without looping?

piRSquared · Accepted Answer

I use groupby and size

df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0)

enter image description here

Timing

enter image description here

1,000,000 rows

df = pd.DataFrame(dict(id=np.random.choice(100, 1000000),                        group=np.random.choice(20, 1000000),                        term=np.random.choice(10, 1000000)))

enter image description here

MaxU - stop WAR against UA · Answer

using pivot_table() method:

In [22]: df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) Out[22]: term      term1  term2  term3 id group 1  1          2      1      0    2          0      1      0 2  2          1      0      1    3          1      0      0

Timing against 700K rows DF:

In [24]: df = pd.concat([df] * 10**5, ignore_index=True)  In [25]: df.shape Out[25]: (700000, 3)  In [3]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 226 ms per loop  In [4]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 236 ms per loop  In [5]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 355 ms per loop  In [6]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 232 ms per loop  In [7]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 231 ms per loop

Timing against 7M rows DF:

In [9]: df = pd.concat([df] * 10, ignore_index=True)  In [10]: df.shape Out[10]: (7000000, 3)  In [11]: %timeit df.groupby(['id', 'group', 'term'])['term'].size().unstack(fill_value=0) 1 loop, best of 3: 2.27 s per loop  In [12]: %timeit df.pivot_table(index=['id','group'], columns='term', aggfunc='size', fill_value=0) 1 loop, best of 3: 2.3 s per loop  In [13]: %timeit pd.crosstab([df.id, df.group], df.term) 1 loop, best of 3: 3.37 s per loop  In [14]: %timeit df.groupby(['id','group','term'])['term'].size().unstack().fillna(0).astype(int) 1 loop, best of 3: 2.28 s per loop  In [15]: %timeit df.groupby(['id', 'group', 'term']).size().unstack(fill_value=0) 1 loop, best of 3: 1.89 s per loop

Groupby value counts on the dataframe pandas

Tags:

python

pandas

dataframe

pandas-groupby

crosstab

Salvador Dali

2 Answers

Timing

piRSquared

MaxU - stop WAR against UA

Recent Activity

Donate For Us

Groupby value counts on the dataframe pandas

Tags:

python

pandas

dataframe

pandas-groupby

crosstab

Salvador Dali

2 Answers

Timing

piRSquared

MaxU - stop WAR against UA

Related questions

Recent Activity

Donate For Us