Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove low frequency values from pandas.dataframe

Tags:

python

pandas

How can I remove values from a column in pandas.DataFrame, that occurs rarely, i.e. with a low frequency? Example:

In [4]: df[col_1].value_counts()  Out[4]: 0       189096         1       110500         2        77218         3        61372               ...         2065         1         2067         1         1569         1         dtype: int64 

So, my question is: how to remove values like 2065, 2067, 1569 and others? And how can I do this for ALL columns, that contain .value_counts() like this?

UPDATE: About 'low' I mean values like 2065. This value occurs in col_1 1 (one) times and I want to remove values like this.

like image 928
Gilaztdinov Rustam Avatar asked Sep 10 '15 20:09

Gilaztdinov Rustam


People also ask

What does Value_counts () do in pandas?

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

What is EQ () in pandas?

Pandas DataFrame eq() Method The eq() method compares each value in a DataFrame to check if it is equal to a specified value, or a value from a specified DataFrame objects, and returns a DataFrame with boolean True/False for each comparison.

How do I reduce panda memory usage?

Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.

How do I change the frequency on pandas?

The pandas PeriodIndex. to_timestamp() method is used to convert a PeriodIndex object to Timestamp and set the frequency. frequency can be set using the 'freq' parameter of the method.


1 Answers

I see there are two ways you might want to do this.

For the entire DataFrame

This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.

import pandas as pd import numpy as np  df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),          columns = ['A', 'B'])  threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame  to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True) 

Column-by-column

This method removes the entries that occur infrequently in each column.

import pandas as pd import numpy as np  df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),          columns = ['A', 'B'])  threshold = 10 # Anything that occurs less than this will be removed. for col in df.columns:     value_counts = df[col].value_counts() # Specific column      to_remove = value_counts[value_counts <= threshold].index     df[col].replace(to_remove, np.nan, inplace=True) 
like image 76
thecircus Avatar answered Sep 19 '22 15:09

thecircus