So I have a pandas DataFrame that looks like this:
r vals positions
1.2 1
1.8 2
2.3 1
1.8 1
2.1 3
2.0 3
1.9 1
... ...
I would like the filter out all rows by position that do not appear at least 20 times. I have seen something like this
g=df.groupby('positions')
g.filter(lambda x: len(x) > 20)
but this does not seem to work and I do not understand how to get the original dataframe back from this. Thanks in advance for the help.
You can delete a list of rows from Pandas by passing the list of indices to the drop() method. In this code, [5,6] is the index of the rows you want to delete. axis=0 denotes that rows should be deleted from the dataframe.
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
On your limited dataset the following works:
In [125]:
df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
Out[125]:
0 1.2
2 2.3
3 1.8
6 1.9
Name: r vals, dtype: float64
You can assign the result of this filter and use this with isin
to filter your orig df:
In [129]:
filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
df[df['r vals'].isin(filtered)]
Out[129]:
r vals positions
0 1.2 1
1 1.8 2
2 2.3 1
3 1.8 1
6 1.9 1
You just need to change 3
to 20
in your case
Another approach would be to use value_counts
to create an aggregate series, we can then use this to filter your df:
In [136]:
counts = df['positions'].value_counts()
counts
Out[136]:
1 4
3 2
2 1
dtype: int64
In [137]:
counts[counts > 3]
Out[137]:
1 4
dtype: int64
In [135]:
df[df['positions'].isin(counts[counts > 3].index)]
Out[135]:
r vals positions
0 1.2 1
2 2.3 1
3 1.8 1
6 1.9 1
EDIT
If you want to filter the groupby object on the dataframe rather than a Series then you can call filter
on the groupby object directly:
In [139]:
filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
filtered
Out[139]:
r vals positions
0 1.2 1
2 2.3 1
3 1.8 1
6 1.9 1
I like the following method:
def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
"""Filters the DataFrame based on the value frequency in the specified column.
:param df: DataFrame to be filtered.
:param column: Column name that should be frequency filtered.
:param min_freq: Minimal value frequency for the row to be accepted.
:return: Frequency filtered DataFrame.
"""
# Frequencies of each value in the column.
freq = df[column].value_counts()
# Select frequent values. Value is in the index.
frequent_values = freq[freq >= min_freq].index
# Return only rows with value frequency above threshold.
return df[df[column].isin(frequent_values)]
It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.
How about selecting all position
rows with values >= 20
mask = df['position'] >= 20
sel = df.ix[mask, :]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With