Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: How to filter dataframe for duplicate items that occur at least n times in a dataframe

I have a Pandas DataFrame that contains duplicate entries; some items are listed twice or three times. I would like to filter it so that it only shows items that are listed at least n times:

  • the DataFrame contains 3 columns: ['colA', 'colB', 'colC']. It should only consider 'colB' in determining whether the item is listed multiple times.
  • Note: this is not drop_duplicates(). It's the opposite, I would like to drop items that are in the dataframe less than n times.
  • The end result should list each item only once.
like image 360
Nickpick Avatar asked Oct 03 '15 03:10

Nickpick


1 Answers

You can use value_counts to get the item count and then construct a boolean mask from this and reference the index and test membership using isin:

In [3]:
df = pd.DataFrame({'a':[0,0,0,1,2,2,3,3,3,3,3,3,4,4,4]})
df

Out[3]:
    a
0   0
1   0
2   0
3   1
4   2
5   2
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

In [8]:
df[df['a'].isin(df['a'].value_counts()[df['a'].value_counts()>2].index)]

Out[8]:
    a
0   0
1   0
2   0
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

So breaking the above down:

In [9]:
df['a'].value_counts() > 2

Out[9]:
3     True
4     True
0     True
2    False
1    False
Name: a, dtype: bool

In [10]:
# construct a boolean mask
df['a'].value_counts()[df['a'].value_counts()>2]

Out[10]:
3    6
4    3
0    3
Name: a, dtype: int64

In [11]:
# we're interested in the index here, pass this to isin
df['a'].value_counts()[df['a'].value_counts()>2].index

Out[11]:
Int64Index([3, 4, 0], dtype='int64')

EDIT

As user @JonClements suggested a simpler and faster method would be to groupby on the col of interest and filter it:

In [4]:
df.groupby('a').filter(lambda x: len(x) > 2)

Out[4]:
    a
0   0
1   0
2   0
6   3
7   3
8   3
9   3
10  3
11  3
12  4
13  4
14  4

EDIT 2

To get just a single entry for each repeat call drop_duplicates and pass param subset='a':

In [2]:
df.groupby('a').filter(lambda x: len(x) > 2).drop_duplicates(subset='a')

Out[2]:
    a
0   0
6   3
12  4
like image 68
EdChum Avatar answered Oct 08 '22 12:10

EdChum