I've got some data on sales, say, and want to look at how different post codes compare: do some deliver more profitable business than others? So I'm grouping by postcode, and can easily get various stats out on a per postcode basis. However, there are a few very high value jobs which distort the stats, so what I'd like to do is ignore the outliers. For various reasons, what I'd like to do is define the outliers by group: so, for example, drop the rows in the dataframe that are in the top xth percentile of their group, or the top n in their group.
So if I've got the following data frame:
>>> df
Out[67]:
A C D
0 foo -0.536732 0.061055
1 bar 1.470956 1.350996
2 foo 1.981810 0.676978
3 bar -0.072829 0.417285
4 foo -0.910537 -1.634047
5 bar -0.346749 -0.127740
6 foo 0.959957 -1.068385
7 foo -0.640706 2.635910
I'd like to be able to have some function, say drop_top_n(df, group_column, value_column, number_to_drop)
where drop_top_n(df, "A", "C", 2)
would return
A C D
0 foo -0.536732 0.061055
4 foo -0.910537 -1.634047
5 bar -0.346749 -0.127740
7 foo -0.640706 2.635910
Using filter
drops whole groups, rather than parts of groups.
I could iterate through the groups, I suppose, and for each group find out which rows to drop, and then go back to the original dataframe and drop them, but that seems terribly clumsy. Is there a better way?
To drop rows based on certain conditions, select the index of the rows which pass the specific condition and pass that index to the drop() method. In this code, (df['Unit_Price'] >400) & (df['Unit_Price'] < 600) is the condition to drop the rows.
Pandas provide data analysts a way to delete and filter data frame using dataframe. drop() method. We can use this method to drop such rows that do not satisfy the given conditions.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
To delete a row from a DataFrame, use the drop() method and set the index label as the parameter.
In 0.13 you can use cumcount:
In [11]: df[df.sort('C').groupby('A').cumcount(ascending=False) >= 2] # use .sort_index() to remove UserWarning
Out[11]:
A C D
0 foo -0.536732 0.061055
4 foo -0.910537 -1.634047
5 bar -0.346749 -0.127740
7 foo -0.640706 2.635910
[4 rows x 3 columns]
It may make more sense to sort first:
In [21]: df = df.sort('C')
In [22]: df[df.groupby('A').cumcount(ascending=False) >= 2]
Out[22]:
A C D
4 foo -0.910537 -1.634047
7 foo -0.640706 2.635910
0 foo -0.536732 0.061055
5 bar -0.346749 -0.127740
[4 rows x 3 columns]
You can use apply()
method:
import pandas as pd
import io
txt=""" A C D
0 foo -0.536732 0.061055
1 bar 1.470956 1.350996
2 foo 1.981810 0.676978
3 bar -0.072829 0.417285
4 foo -0.910537 -1.634047
5 bar -0.346749 -0.127740
6 foo 0.959957 -1.068385
7 foo -0.640706 2.635910"""
df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True, index_col=0)
def f(df):
return df.sort("C").iloc[:-2]
df2 = df.groupby("A", group_keys=False).apply(f)
print df2
output:
A C D
5 bar -0.346749 -0.127740
4 foo -0.910537 -1.634047
7 foo -0.640706 2.635910
0 foo -0.536732 0.061055
If you want original order:
print df2.reindex(df.index[df.index.isin(df2.index)])
output:
A C D
0 foo -0.536732 0.061055
4 foo -0.910537 -1.634047
5 bar -0.346749 -0.127740
7 foo -0.640706 2.635910
to get rows above group mean:
def f(df):
return df[df.C>df.C.mean()]
df3 = df.groupby("A", group_keys=False).apply(f)
print df3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With