Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: remove entries based on the number of occurrences

Tags:

I'm trying to remove entries from a data frame which occur less than 100 times. The data frame data looks like this:

pid   tag 1     23     1     45 1     62 2     24 2     45 3     34 3     25 3     62 

Now I count the number of tag occurrences like this:

bytag = data.groupby('tag').aggregate(np.count_nonzero) 

But then I can't figure out how to remove those entries which have low count...

like image 801
sashkello Avatar asked Nov 19 '12 01:11

sashkello


People also ask

How do I delete rows in pandas DataFrame based on condition?

Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).

How do you count the number of occurrences in pandas?

How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

How do I remove items from pandas series?

From a pandas Series a set of elements can be removed using the index, index labels through the methods drop() and truncate(). The drop() method removes a set of elements at specific index locations. The locations are specified by index or index labels.


2 Answers

New in 0.12, groupby objects have a filter method, allowing you to do these types of operations:

In [11]: g = data.groupby('tag')  In [12]: g.filter(lambda x: len(x) > 1)  # pandas 0.13.1 Out[12]:    pid  tag 1    1   45 2    1   62 4    2   45 7    3   62 

The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.

Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:

In [21]: g.filter(lambda x: len(x) > 1)  # pandas 0.12 Out[21]:     pid  tag 1    1   45 4    2   45 2    1   62 7    3   62 
like image 161
Andy Hayden Avatar answered Sep 17 '22 18:09

Andy Hayden


Edit: Thanks to @WesMcKinney for showing this much more direct way:

data[data.groupby('tag').pid.transform(len) > 1] 

import pandas import numpy as np data = pandas.DataFrame(     {'pid' : [1,1,1,2,2,3,3,3],      'tag' : [23,45,62,24,45,34,25,62],      })  bytag = data.groupby('tag').aggregate(np.count_nonzero) tags = bytag[bytag.pid >= 2].index print(data[data['tag'].isin(tags)]) 

yields

   pid  tag 1    1   45 2    1   62 4    2   45 7    3   62 
like image 41
unutbu Avatar answered Sep 21 '22 18:09

unutbu