Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop some Pandas dataframe rows using group based condition

Tags:

python

pandas

I've got some data on sales, say, and want to look at how different post codes compare: do some deliver more profitable business than others? So I'm grouping by postcode, and can easily get various stats out on a per postcode basis. However, there are a few very high value jobs which distort the stats, so what I'd like to do is ignore the outliers. For various reasons, what I'd like to do is define the outliers by group: so, for example, drop the rows in the dataframe that are in the top xth percentile of their group, or the top n in their group.

So if I've got the following data frame:

>>> df
Out[67]: 
     A         C         D
0  foo -0.536732  0.061055
1  bar  1.470956  1.350996
2  foo  1.981810  0.676978
3  bar -0.072829  0.417285
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
6  foo  0.959957 -1.068385
7  foo -0.640706  2.635910

I'd like to be able to have some function, say drop_top_n(df, group_column, value_column, number_to_drop) where drop_top_n(df, "A", "C", 2) would return

     A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

Using filter drops whole groups, rather than parts of groups.

I could iterate through the groups, I suppose, and for each group find out which rows to drop, and then go back to the original dataframe and drop them, but that seems terribly clumsy. Is there a better way?

like image 860
lpryor Avatar asked Jan 19 '14 19:01

lpryor


People also ask

How do I drop rows in Pandas DataFrame based on condition?

To drop rows based on certain conditions, select the index of the rows which pass the specific condition and pass that index to the drop() method. In this code, (df['Unit_Price'] >400) & (df['Unit_Price'] < 600) is the condition to drop the rows.

How do I delete rows in Pandas based on multiple conditions?

Pandas provide data analysts a way to delete and filter data frame using dataframe. drop() method. We can use this method to drop such rows that do not satisfy the given conditions.

How do you drop rows in Pandas based on row value?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.

How do you delete certain rows in Python?

To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.


2 Answers

In 0.13 you can use cumcount:

In [11]: df[df.sort('C').groupby('A').cumcount(ascending=False) >= 2]  # use .sort_index() to remove UserWarning
Out[11]: 
     A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

[4 rows x 3 columns]

It may make more sense to sort first:

In [21]: df = df.sort('C')

In [22]: df[df.groupby('A').cumcount(ascending=False) >= 2]
Out[22]: 
     A         C         D
4  foo -0.910537 -1.634047
7  foo -0.640706  2.635910
0  foo -0.536732  0.061055
5  bar -0.346749 -0.127740

[4 rows x 3 columns]
like image 54
Andy Hayden Avatar answered Sep 30 '22 03:09

Andy Hayden


You can use apply() method:

import pandas as pd
import io


txt="""     A         C         D
0  foo -0.536732  0.061055
1  bar  1.470956  1.350996
2  foo  1.981810  0.676978
3  bar -0.072829  0.417285
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
6  foo  0.959957 -1.068385
7  foo -0.640706  2.635910"""

df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True, index_col=0)

def f(df):
    return df.sort("C").iloc[:-2]
df2 = df.groupby("A", group_keys=False).apply(f)
print df2

output:

     A         C         D
5  bar -0.346749 -0.127740
4  foo -0.910537 -1.634047
7  foo -0.640706  2.635910
0  foo -0.536732  0.061055

If you want original order:

print df2.reindex(df.index[df.index.isin(df2.index)])

output:

    A         C         D
0  foo -0.536732  0.061055
4  foo -0.910537 -1.634047
5  bar -0.346749 -0.127740
7  foo -0.640706  2.635910

to get rows above group mean:

def f(df):
    return df[df.C>df.C.mean()]
df3 = df.groupby("A", group_keys=False).apply(f)
print df3
like image 43
HYRY Avatar answered Sep 30 '22 05:09

HYRY