What is the best practice to remove all rows that has a column with low frequency value?
Dataframe:
IN:
foo bar poo
1 a A
2 a A
3 a B
4 b B
5 b A
6 b A
7 c C
8 d B
9 e B
Example 1: Remove all rows that have less than 3 in frequency value in column 'poo':
OUT:
foo bar poo
1 a A
2 a A
3 a B
4 b B
5 b A
6 b A
8 d B
9 e B
Example 2: Remove all rows that have less than 3 in frequency value in column 'bar':
OUT:
foo bar poo
1 a A
2 a A
3 a B
4 b B
5 b A
6 b A
Note that the argument axis must be set to 0 for deleting rows (In Pandas drop (), the axis defaults to 0, so it can be omitted). If axis=1 is specified, it will delete columns instead. Alternatively, a more intuitive way to delete a row from DataFrame is to use the index argument. 2. Delete multiple rows
Now , we have to drop rows based on the conditions. Just specify the column name with a condition. dataframe.drop (dataframe [dataframe ['column'] operator value].index) Example 1: In this example, we are going to drop the rows based on cost column Example 2: In this example, we are going to drop the rows based on quantity column
It is similar to table that stores the data in rows and columns. Rows represents the records/ tuples and columns refers to the attributes. We can create the DataFrame by using pandas.DataFrame () method. We can also create a DataFrame using dictionary by skipping columns and indices. Let’s see an example.
As you can see based on Table 1, our example data is a DataFrame and comprises six rows and three variables called “x1”, “x2”, and “x3”. This example shows how to delete certain rows of a pandas DataFrame based on a column of this DataFrame.
This should generalise pretty easily. You'll need groupby
+ transform
+ count
, and then filter the result:
col = 'poo' # 'bar'
n = 3 # 2
df[df.groupby(col)[col].transform('count').ge(n)]
foo bar poo
0 1 a A
1 2 a A
2 3 a B
3 4 b B
4 5 b A
5 6 b A
7 8 d B
8 9 e B
IIUC filter ..
df.groupby('poo').filter(lambda x : (x['poo'].count()>=3).any())
Out[81]:
foo bar poo
0 1 a A
1 2 a A
2 3 a B
3 4 b B
4 5 b A
5 6 b A
7 8 d B
8 9 e B
Or combine value_counts
with isin
s=df.poo.value_counts().gt(3)
df.loc[df.poo.isin(s[s].index)]
Out[89]:
foo bar poo
0 1 a A
1 2 a A
2 3 a B
3 4 b B
4 5 b A
5 6 b A
7 8 d B
8 9 e B
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With