I am working on a data frame in python How can I filter all the rows that have value for a particular column , say val, which fall within 1st and 3rd quartile.
Thank you.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
Use drop() method to delete rows based on column value in pandas DataFrame, as part of the data cleansing, you would be required to drop rows from the DataFrame when a column value matches with a static value or on another column value.
low, high = df.B.quantile([0.25,0.75])
df.query('{low}<B<{high}'.format(low=low,high=high))
Let's create some random data with 100 rows and three columns:
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))
Now let's use loc
to filter out all data in column B
above and below its top and bottom quartile (retaining the middle).
lower_quantile, upper_quantile = df.B.quantile([.25, .75])
>>> df.loc[(df.B > lower_quantile) & (df.B < upper_quantile)].head()
A B C
0 1.764052 0.400157 0.978738
2 0.950088 -0.151357 -0.103219
3 0.410599 0.144044 1.454274
4 0.761038 0.121675 0.443863
10 0.154947 0.378163 -0.887786
Using pd.Series.between()
and unpacking the quantile
values produced by df.A.quantile([lower, upper])
, you can filter your DataFrame
, here illustrated using sample data ranging 0-100:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={'A': np.random.randint(0, 100, 10), 'B': np.arange(10)})
A B
0 4 0
1 21 1
2 96 2
3 50 3
4 82 4
5 24 5
6 93 6
7 16 7
8 14 8
9 40 9
df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]
A B
1 21 1
3 50 3
5 24 5
9 40 9
On performance: .query()
slows things down 2x:
df = DataFrame(data={'A': np.random.randint(0, 100, 1000), 'B': np.arange(1000)})
def query(df):
low, high = df.B.quantile([0.25,0.75])
df.query('{low}<B<{high}'.format(low=low,high=high))
%timeit query(df)
1000 loops, best of 3: 1.81 ms per loop
def between(df):
df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]
%timeit between(df)
1000 loops, best of 3: 995 µs per loop
@Alexander's solution performs identical to the one using .between()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With