Drop some Pandas dataframe rows using group based condition

Q: How do I delete rows in Pandas based on multiple conditions?

Pandas provide data analysts a way to delete and filter data frame using dataframe. drop() method. We can use this method to drop such rows that do not satisfy the given conditions.

Q: How do you drop rows in Pandas based on row value?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

Q: How do you delete certain rows in Python?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.

Tags:

python

pandas

I've got some data on sales, say, and want to look at how different post codes compare: do some deliver more profitable business than others? So I'm grouping by postcode, and can easily get various stats out on a per postcode basis. However, there are a few very high value jobs which distort the stats, so what I'd like to do is ignore the outliers. For various reasons, what I'd like to do is define the outliers by group: so, for example, drop the rows in the dataframe that are in the top xth percentile of their group, or the top n in their group.

So if I've got the following data frame:

>>> df
Out[67]: 
     A         C         D
0  foo -0.536732  0.061055
1  bar  1.470956  1.350996
2  foo  1.981810  0.676978
3  bar -0.072829  0.417285
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
6  foo  0.959957 -1.068385
7  foo -0.640706  2.635910

I'd like to be able to have some function, say drop_top_n(df, group_column, value_column, number_to_drop) where drop_top_n(df, "A", "C", 2) would return

     A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

Using filter drops whole groups, rather than parts of groups.

I could iterate through the groups, I suppose, and for each group find out which rows to drop, and then go back to the original dataframe and drop them, but that seems terribly clumsy. Is there a better way?

860

asked Jan 19 '14 19:01

lpryor

2 Answers

In 0.13 you can use cumcount:

In [11]: df[df.sort('C').groupby('A').cumcount(ascending=False) >= 2]  # use .sort_index() to remove UserWarning
Out[11]: 
     A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

[4 rows x 3 columns]

It may make more sense to sort first:

In [21]: df = df.sort('C')

In [22]: df[df.groupby('A').cumcount(ascending=False) >= 2]
Out[22]: 
     A         C         D
4  foo -0.910537 -1.634047
7  foo -0.640706  2.635910
0  foo -0.536732  0.061055
5  bar -0.346749 -0.127740

[4 rows x 3 columns]

answered Sep 30 '22 03:09

Andy Hayden

You can use apply() method:

import pandas as pd
import io


txt="""     A         C         D
0  foo -0.536732  0.061055
1  bar  1.470956  1.350996
2  foo  1.981810  0.676978
3  bar -0.072829  0.417285
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
6  foo  0.959957 -1.068385
7  foo -0.640706  2.635910"""

df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True, index_col=0)

def f(df):
    return df.sort("C").iloc[:-2]
df2 = df.groupby("A", group_keys=False).apply(f)
print df2

output:

     A         C         D
5  bar -0.346749 -0.127740
4  foo -0.910537 -1.634047
7  foo -0.640706  2.635910
0  foo -0.536732  0.061055

If you want original order:

print df2.reindex(df.index[df.index.isin(df2.index)])

output:

    A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

to get rows above group mean:

def f(df):
    return df[df.C>df.C.mean()]
df3 = df.groupby("A", group_keys=False).apply(f)
print df3

answered Sep 30 '22 05:09

HYRY

Related questions
                            
                                Python regex: find a substring that doesn't contain a substring
                            
                                Fixing code to make a triangle
                            
                                Is there a Python API for R's ggplot2? [closed]
                            
                                keeps text rotated in data coordinate system after resizing?
                            
                                Django Rest Framework Paginate By Specified GET Parameter
                            
                                So what is the story with 4 quotes?
                            
                                Django: Map data from an external APIs into a model?
                            
                                Summing each 3x3 window of a M*N matrix, into a M/3*N/3 matrix with numpy
                            
                                numpy.dot how to calculate 1-D array with 2-D array
                            
                                Error installing pydev [duplicate]
                            
                                Theano element wise maximum
                            
                                Expand/collapse ttk Treeview branch
                            
                                How to separate string and number in Python list? [closed]
                            
                                Python: How do I get a list of all keys in a dictionary of dictionaries, at a given depth [closed]
                            
                                Python scipy chisquare returns different values than R chisquare
                            
                                Pandas: iterate over unique values of a column that is already in sorted order
                            
                                How to vectorize finding max value in numpy array with if statement?
                            
                                Create a glowing border in QSS
                            
                                Python optimization through bytecode
                            
                                indexing numpy array with logical operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With