Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I filter a pandas DataFrame based on value counts?

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).

Help?

like image 758
uchuujin Avatar asked Apr 24 '15 00:04

uchuujin


People also ask

How do you filter DataFrame for certain values?

Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.

How do I filter data based on conditions in Pandas?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you use value count in Pandas?

syntax to use value_counts on a Pandas dataframe This is really simple. You just type the name of the dataframe then . value_counts() . When you use value_counts on a dataframe, it will count the number of records for every combination of unique values for every column.


2 Answers

Use groupby filter:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])  In [12]: df Out[12]:    A  B 0  1  2 1  1  4 2  5  6  In [13]: df.groupby("A").filter(lambda x: len(x) > 1) Out[13]:    A  B 0  1  2 1  1  4 

I recommend reading the split-combine-section of the docs.

like image 164
Andy Hayden Avatar answered Sep 25 '22 21:09

Andy Hayden


Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:

df1 = df[df.groupby("A")['A'].transform('size') > 1] 

Or use Series.map with Series.value_counts:

df1 = df[df['A'].map(df['A'].value_counts()) > 1] 
like image 41
jezrael Avatar answered Sep 24 '22 21:09

jezrael