I'm trying to remove entries from a data frame which occur less than 100 times. The data frame data
looks like this:
pid tag 1 23 1 45 1 62 2 24 2 45 3 34 3 25 3 62
Now I count the number of tag occurrences like this:
bytag = data.groupby('tag').aggregate(np.count_nonzero)
But then I can't figure out how to remove those entries which have low count...
Use pandas. DataFrame. drop() method to delete/remove rows with condition(s).
How do you Count the Number of Occurrences in a data frame? To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.
The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
From a pandas Series a set of elements can be removed using the index, index labels through the methods drop() and truncate(). The drop() method removes a set of elements at specific index locations. The locations are specified by index or index labels.
New in 0.12, groupby objects have a filter
method, allowing you to do these types of operations:
In [11]: g = data.groupby('tag') In [12]: g.filter(lambda x: len(x) > 1) # pandas 0.13.1 Out[12]: pid tag 1 1 45 2 1 62 4 2 45 7 3 62
The function (the first argument of filter) is applied to each group (subframe), and the results include elements of the original DataFrame belonging to groups which evaluated to True.
Note: in 0.12 the ordering is different than in the original DataFrame, this was fixed in 0.13+:
In [21]: g.filter(lambda x: len(x) > 1) # pandas 0.12 Out[21]: pid tag 1 1 45 4 2 45 2 1 62 7 3 62
Edit: Thanks to @WesMcKinney for showing this much more direct way:
data[data.groupby('tag').pid.transform(len) > 1]
import pandas import numpy as np data = pandas.DataFrame( {'pid' : [1,1,1,2,2,3,3,3], 'tag' : [23,45,62,24,45,34,25,62], }) bytag = data.groupby('tag').aggregate(np.count_nonzero) tags = bytag[bytag.pid >= 2].index print(data[data['tag'].isin(tags)])
yields
pid tag 1 1 45 2 1 62 4 2 45 7 3 62
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With