Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe delete rows with low frequency

What is the best practice to remove all rows that has a column with low frequency value?

Dataframe:

IN:
foo bar poo
1   a   A
2   a   A
3   a   B
4   b   B
5   b   A
6   b   A
7   c   C
8   d   B
9   e   B

Example 1: Remove all rows that have less than 3 in frequency value in column 'poo':

OUT:
foo bar poo
1   a   A
2   a   A
3   a   B
4   b   B
5   b   A
6   b   A
8   d   B
9   e   B

Example 2: Remove all rows that have less than 3 in frequency value in column 'bar':

OUT:
foo bar poo
1   a   A
2   a   A
3   a   B
4   b   B
5   b   A
6   b   A
like image 251
AnonX Avatar asked Mar 06 '18 17:03

AnonX


People also ask

How to delete a row from a Dataframe in pandas?

Note that the argument axis must be set to 0 for deleting rows (In Pandas drop (), the axis defaults to 0, so it can be omitted). If axis=1 is specified, it will delete columns instead. Alternatively, a more intuitive way to delete a row from DataFrame is to use the index argument. 2. Delete multiple rows

How to drop rows based on conditions in a Dataframe?

Now , we have to drop rows based on the conditions. Just specify the column name with a condition. dataframe.drop (dataframe [dataframe ['column'] operator value].index) Example 1: In this example, we are going to drop the rows based on cost column Example 2: In this example, we are going to drop the rows based on quantity column

How to create a Dataframe in pandas?

It is similar to table that stores the data in rows and columns. Rows represents the records/ tuples and columns refers to the attributes. We can create the DataFrame by using pandas.DataFrame () method. We can also create a DataFrame using dictionary by skipping columns and indices. Let’s see an example.

How many rows are in a Dataframe in Python?

As you can see based on Table 1, our example data is a DataFrame and comprises six rows and three variables called “x1”, “x2”, and “x3”. This example shows how to delete certain rows of a pandas DataFrame based on a column of this DataFrame.


2 Answers

This should generalise pretty easily. You'll need groupby + transform + count, and then filter the result:

col = 'poo'  # 'bar'
n = 3        # 2

df[df.groupby(col)[col].transform('count').ge(n)]

   foo bar poo
0    1   a   A
1    2   a   A
2    3   a   B
3    4   b   B
4    5   b   A
5    6   b   A
7    8   d   B
8    9   e   B
like image 154
cs95 Avatar answered Oct 17 '22 05:10

cs95


IIUC filter ..

df.groupby('poo').filter(lambda x : (x['poo'].count()>=3).any())
Out[81]: 
   foo bar poo
0    1   a   A
1    2   a   A
2    3   a   B
3    4   b   B
4    5   b   A
5    6   b   A
7    8   d   B
8    9   e   B

Or combine value_counts with isin

s=df.poo.value_counts().gt(3)
df.loc[df.poo.isin(s[s].index)]
Out[89]: 
   foo bar poo
0    1   a   A
1    2   a   A
2    3   a   B
3    4   b   B
4    5   b   A
5    6   b   A
7    8   d   B
8    9   e   B
like image 5
BENY Avatar answered Oct 17 '22 05:10

BENY