Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: exclude rows below a certain frequency count

So I have a pandas DataFrame that looks like this:

r vals    positions
1.2       1
1.8       2
2.3       1
1.8       1
2.1       3
2.0       3
1.9       1
...       ...

I would like the filter out all rows by position that do not appear at least 20 times. I have seen something like this

g=df.groupby('positions')
g.filter(lambda x: len(x) > 20)

but this does not seem to work and I do not understand how to get the original dataframe back from this. Thanks in advance for the help.

like image 501
Wes Field Avatar asked May 27 '15 14:05

Wes Field


People also ask

How do you omit rows in Pandas?

You can delete a list of rows from Pandas by passing the list of indices to the drop() method. In this code, [5,6] is the index of the rows you want to delete. axis=0 denotes that rows should be deleted from the dataframe.

What does value_counts () do in Pandas?

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

How do I delete rows in Pandas DataFrame based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you count unique frequency in Pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.


3 Answers

On your limited dataset the following works:

In [125]:
df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)

Out[125]:
0    1.2
2    2.3
3    1.8
6    1.9
Name: r vals, dtype: float64

You can assign the result of this filter and use this with isin to filter your orig df:

In [129]:
filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
df[df['r vals'].isin(filtered)]

Out[129]:
   r vals  positions
0     1.2          1
1     1.8          2
2     2.3          1
3     1.8          1
6     1.9          1

You just need to change 3 to 20 in your case

Another approach would be to use value_counts to create an aggregate series, we can then use this to filter your df:

In [136]:
counts = df['positions'].value_counts()
counts

Out[136]:
1    4
3    2
2    1
dtype: int64

In [137]:
counts[counts > 3]

Out[137]:
1    4
dtype: int64

In [135]:
df[df['positions'].isin(counts[counts > 3].index)]

Out[135]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1

EDIT

If you want to filter the groupby object on the dataframe rather than a Series then you can call filter on the groupby object directly:

In [139]:
filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
filtered

Out[139]:
   r vals  positions
0     1.2          1
2     2.3          1
3     1.8          1
6     1.9          1
like image 144
EdChum Avatar answered Oct 22 '22 21:10

EdChum


I like the following method:

def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
    """Filters the DataFrame based on the value frequency in the specified column.

    :param df: DataFrame to be filtered.
    :param column: Column name that should be frequency filtered.
    :param min_freq: Minimal value frequency for the row to be accepted.
    :return: Frequency filtered DataFrame.
    """
    # Frequencies of each value in the column.
    freq = df[column].value_counts()
    # Select frequent values. Value is in the index.
    frequent_values = freq[freq >= min_freq].index
    # Return only rows with value frequency above threshold.
    return df[df[column].isin(frequent_values)]

It is much faster than the filter lambda method in the accepted answer - python overhead is minimised.

like image 22
Piotr Dabkowski Avatar answered Oct 22 '22 19:10

Piotr Dabkowski


How about selecting all position rows with values >= 20

mask = df['position'] >= 20
sel = df.ix[mask, :]
like image 1
Paul Joireman Avatar answered Oct 22 '22 19:10

Paul Joireman