Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter rows that fall within 1st and 3rd quartile of a particular column in pandas dataframe?

I am working on a data frame in python How can I filter all the rows that have value for a particular column , say val, which fall within 1st and 3rd quartile.

Thank you.

like image 637
Neel Shah Avatar asked Apr 20 '16 04:04

Neel Shah


People also ask

How do I filter specific rows from a DataFrame pandas?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you drop rows in pandas based on multiple column values?

Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.


3 Answers

low, high = df.B.quantile([0.25,0.75])
df.query('{low}<B<{high}'.format(low=low,high=high))
like image 174
PhilChang Avatar answered Sep 30 '22 18:09

PhilChang


Let's create some random data with 100 rows and three columns:

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

Now let's use loc to filter out all data in column B above and below its top and bottom quartile (retaining the middle).

lower_quantile, upper_quantile = df.B.quantile([.25, .75])

>>> df.loc[(df.B > lower_quantile) & (df.B < upper_quantile)].head()
           A         B         C
0   1.764052  0.400157  0.978738
2   0.950088 -0.151357 -0.103219
3   0.410599  0.144044  1.454274
4   0.761038  0.121675  0.443863
10  0.154947  0.378163 -0.887786
like image 38
Alexander Avatar answered Sep 30 '22 18:09

Alexander


Using pd.Series.between() and unpacking the quantile values produced by df.A.quantile([lower, upper]), you can filter your DataFrame, here illustrated using sample data ranging 0-100:

import numpy as np
import pandas as pd

df = pd.DataFrame(data={'A': np.random.randint(0, 100, 10), 'B': np.arange(10)})

    A  B
0   4  0
1  21  1
2  96  2
3  50  3
4  82  4
5  24  5
6  93  6
7  16  7
8  14  8
9  40  9

df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]


    A  B
1  21  1
3  50  3
5  24  5
9  40  9

On performance: .query() slows things down 2x:

df = DataFrame(data={'A': np.random.randint(0, 100, 1000), 'B': np.arange(1000)})

def query(df):
    low, high = df.B.quantile([0.25,0.75])
    df.query('{low}<B<{high}'.format(low=low,high=high))

%timeit query(df)
1000 loops, best of 3: 1.81 ms per loop

def between(df):
    df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]

%timeit between(df)
1000 loops, best of 3: 995 µs per loop

@Alexander's solution performs identical to the one using .between().

like image 25
Stefan Avatar answered Sep 30 '22 17:09

Stefan