How can I remove values from a column in pandas.DataFrame
, that occurs rarely, i.e. with a low frequency? Example:
In [4]: df[col_1].value_counts() Out[4]: 0 189096 1 110500 2 77218 3 61372 ... 2065 1 2067 1 1569 1 dtype: int64
So, my question is: how to remove values like 2065, 2067, 1569
and others? And how can I do this for ALL columns, that contain .value_counts()
like this?
UPDATE: About 'low' I mean values like 2065
. This value occurs in col_1
1 (one) times and I want to remove values like this.
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
Pandas DataFrame eq() Method The eq() method compares each value in a DataFrame to check if it is equal to a specified value, or a value from a specified DataFrame objects, and returns a DataFrame with boolean True/False for each comparison.
Simply Convert the int64 values as int8 and float64 as float8. This will reduce memory usage.
The pandas PeriodIndex. to_timestamp() method is used to convert a PeriodIndex object to Timestamp and set the frequency. frequency can be set using the 'freq' parameter of the method.
I see there are two ways you might want to do this.
For the entire DataFrame
This method removes the values that occur infrequently in the entire DataFrame. We can do it without loops, using built-in functions to speed things up.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. value_counts = df.stack().value_counts() # Entire DataFrame to_remove = value_counts[value_counts <= threshold].index df.replace(to_remove, np.nan, inplace=True)
Column-by-column
This method removes the entries that occur infrequently in each column.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)), columns = ['A', 'B']) threshold = 10 # Anything that occurs less than this will be removed. for col in df.columns: value_counts = df[col].value_counts() # Specific column to_remove = value_counts[value_counts <= threshold].index df[col].replace(to_remove, np.nan, inplace=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With