Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas one liner to filter rows by nunique count on a specific column

Tags:

python

pandas

In pandas, I regularly use the following to filter a dataframe by number of occurrences

df = df.groupby('A').filter(lambda x: len(x) >= THRESHOLD)

Assume df has another column 'B' and I want to filter the dataframe this time by the count of unique values on that column, I would expect something like

df = df.groupby('A').filter(lambda x: len(np.unique(x['B'])) >= THRESHOLD2)

But that doesn't seem to work, what would be the right approach?

like image 487
bluesummers Avatar asked Nov 21 '17 14:11

bluesummers


People also ask

How do I count a specific value in a column in pandas?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do I filter a specific column in a DataFrame?

We can select a column from a dataframe by using the column name we want to select as a list to filter() function. In this example, we select species column from the dataframe. By default, filter() function selects a column when we provide the column label as a list.

How do I count the number of occurrences in a column in Python?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

Does pandas Nunique count NaN?

Return number of unique elements in the object. Excludes NA values by default. Don't include NaN in the count.


1 Answers

It should working nice with nunique:

df = pd.DataFrame({'B':list('abccee'),
                   'E':[5,3,6,9,2,4],
                   'A':list('aabbcc')})

print (df)
   A  B  E
0  a  a  5
1  a  b  3
2  b  c  6
3  b  c  9
4  c  e  2
5  c  e  4

THRESHOLD2 = 2
df1 = df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
print (df1)
   A  B  E
0  a  a  5
1  a  b  3

But if need faster solution use transform and filter by boolean indexing:

df2 = df[df.groupby('A')['B'].transform('nunique') >= THRESHOLD2]
print (df2)
   A  B  E
0  a  a  5
1  a  b  3

Timings:

np.random.seed(123)
N = 1000000
L = list('abcde') 
df = pd.DataFrame({'B': np.random.choice(L, N, p=(0.75,0.0001,0.0005,0.0005,0.2489)),
                   'A':np.random.randint(10000,size=N)})
df = df.sort_values(['A','B']).reset_index(drop=True)
print (df)

THRESHOLD2 = 3

In [403]: %timeit df.groupby('A').filter(lambda x: x['B'].nunique() >= THRESHOLD2)
1 loop, best of 3: 3.05 s per loop

In [404]: %timeit df[df.groupby('A')['B'].transform('nunique')>= THRESHOLD2]
1 loop, best of 3: 558 ms per loop

Caveat

The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

like image 50
jezrael Avatar answered Nov 14 '22 21:11

jezrael