Faster alternative to perform pandas groupby operation

Question

I have a dataset with name (person_name), day and color (shirt_color) as columns.

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

E.g. input:

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

I need to find the most frequently used color by each person.

E.g. result:

name    color
-------------
Jerry   Black
John    White
Tom     Blue

I am performing the following operation to get the results, which works fine but is quite slow:

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

piRSquared · Accepted Answer

Numpy's `numpy.add.at` and `pandas.factorize`

This is intended to be fast. However, I tried to organize it to be readable as well.

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object

`groupby`, `size`, and `idxmax`

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

`Counter`

¯\_(ツ)_/¯

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

DYZ · Answer

UPDATE

It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupby which is doing a much better job when it concerns non-numerical data.

from itertools import groupby
from collections import Counter

pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y 
          in groupby(sorted(df.values.tolist()), 
                            key=lambda x: x[0])})
# Jerry    Black
# John     White
# Tom       Blue

Old Answer

Here's another method. It is actually slower than the original one, but I'll keep it here:

data.groupby('name')['color']\
    .apply(pd.Series.value_counts)\
    .unstack().idxmax(axis=1)
# name
# Jerry    Black
# John     White
# Tom       Blue

BENY · Answer

Solution from pd.Series.mode

df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]: 
name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

André C. Andersen · Answer

How about doing two groupings with transform(max)?

df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)

Output:

    name  color
0   John  White
1    Tom   Blue
2  Jerry  Black

Faster alternative to perform pandas groupby operation

Tags:

python

pandas

numpy

pandas-groupby

astrobiologist

4 Answers

Numpy's `numpy.add.at` and `pandas.factorize`

`groupby`, `size`, and `idxmax`

`Counter`

piRSquared

DYZ

BENY

André C. Andersen

Recent Activity

Donate For Us

Faster alternative to perform pandas groupby operation

Tags:

python

pandas

numpy

pandas-groupby

astrobiologist

4 Answers

Numpy's numpy.add.at and pandas.factorize

groupby, size, and idxmax

Counter

piRSquared

DYZ

BENY

André C. Andersen

Related questions

Recent Activity

Donate For Us

Numpy's `numpy.add.at` and `pandas.factorize`

`groupby`, `size`, and `idxmax`

`Counter`