Pandas one liner to filter rows by nunique count on a specific column

Tags:

python

pandas

In pandas, I regularly use the following to filter a dataframe by number of occurrences

df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)

Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like

df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)

But that doesn't seem to work, what would be the right approach?

487

asked Nov 21 '17 14:11

bluesummers

1 Answers

It should working nice with nunique:

df = pd.DataFrame({'B':list('abccee'),
                   'E':[5,3,6,9,2,4],
                   'A':list('aabbcc')})

print (df)
   A  B  E
0  a  a  5
1  a  b  3
2  b  c  6
3  b  c  9
4  c  e  2
5  c  e  4

THRESHOLD2 = 2
df1 = df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
print (df1)
   A  B  E
0  a  a  5
1  a  b  3

But if need faster solution use transform and filter by boolean indexing:

df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
   A  B  E
0  a  a  5
1  a  b  3

Timings:

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)

THRESHOLD2 = 3

In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop

In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

answered Nov 14 '22 21:11

jezrael

Related questions
                            
                                Python pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
                            
                                Python - Exclude weekends between two Dates
                            
                                From Ruby to Python - Is there an equivalent of "try"?
                            
                                Groupby and subtract columns in pandas
                            
                                Print the first n numbers of the fibonacci sequence in one expression
                            
                                flask.ext.script is deprecated
                            
                                Base64 encoding in python3
                            
                                Scrapy Return Multiple Items
                            
                                How to control the source IP address of a ZeroMQ packet on a machine with multiple IPs?
                            
                                How to write a large csv file to hdf5 in python?
                            
                                plt.hist() vs np.histogram() - unexpected results
                            
                                How to rotate tick labels in polar matplotlib plot?
                            
                                Display two dataframes side by side in Pandas
                            
                                Copy tables from one database to another in SQL Server, using Python
                            
                                how to calculate correlation between rows in python pandas data frame
                            
                                Dependency management: subprocess32 needed for Python2.7
                            
                                How do I test a Django CreateView?
                            
                                Replace values in NumPy array based on dictionary and avoid overlap between new values and keys
                            
                                how to elegantly parse argumens in python before expensive imports?
                            
                                Pandas: append row to DataFrame with multiindex in columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With