I am working on a data frame in python How can I filter all the rows that have value for a particular column , say val, which fall within 1st and 3rd quartile. Thank you.

<pre class="prettyprint"><code>low, high = df.B.quantile([0.25,0.75]) df.query('{low}<B<{high}'.format(low=low,high=high)) </code></pre>

Using <code>pd.Series.between()</code> and unpacking the <code>quantile</code> values produced by <code>df.A.quantile([lower, upper])</code>, you can filter your <code>DataFrame</code>, here illustrated using sample data ranging 0-100: <pre class="prettyprint"><code>import numpy as np import pandas as pd df = pd.DataFrame(data={'A': np.random.randint(0, 100, 10), 'B': np.arange(10)}) A B 0 4 0 1 21 1 2 96 2 3 50 3 4 82 4 5 24 5 6 93 6 7 16 7 8 14 8 9 40 9 df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())] A B 1 21 1 3 50 3 5 24 5 9 40 9 </code></pre> On performance: <code>.query()</code> slows things down 2x: <pre class="prettyprint"><code>df = DataFrame(data={'A': np.random.randint(0, 100, 1000), 'B': np.arange(1000)}) def query(df): low, high = df.B.quantile([0.25,0.75]) df.query('{low}<B<{high}'.format(low=low,high=high)) %timeit query(df) 1000 loops, best of 3: 1.81 ms per loop def between(df): df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())] %timeit between(df) 1000 loops, best of 3: 995 µs per loop </code></pre> @Alexander's solution performs identical to the one using <code>.between()</code>.

How to filter rows that fall within 1st and 3rd quartile of a particular column in pandas dataframe?

3 Answers

Click to copy

low, high = df.B.quantile([0.25,0.75])
df.query('{low}<B<{high}'.format(low=low,high=high))

174

answered Sep 30 '22 18:09

PhilChang

Let's create some random data with 100 rows and three columns:

Click to copy

import numpy as np
import pandas as pd

np.random.seed(0)

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

Now let's use loc to filter out all data in column B above and below its top and bottom quartile (retaining the middle).

Click to copy

lower_quantile, upper_quantile = df.B.quantile([.25, .75])

>>> df.loc[(df.B > lower_quantile) & (df.B < upper_quantile)].head()
           A         B         C
0   1.764052  0.400157  0.978738
2   0.950088 -0.151357 -0.103219
3   0.410599  0.144044  1.454274
4   0.761038  0.121675  0.443863
10  0.154947  0.378163 -0.887786

answered Sep 30 '22 18:09

Alexander

Using pd.Series.between() and unpacking the quantile values produced by df.A.quantile([lower, upper]), you can filter your DataFrame, here illustrated using sample data ranging 0-100:

Click to copy

import numpy as np
import pandas as pd

df = pd.DataFrame(data={'A': np.random.randint(0, 100, 10), 'B': np.arange(10)})

    A  B
0   4  0
1  21  1
2  96  2
3  50  3
4  82  4
5  24  5
6  93  6
7  16  7
8  14  8
9  40  9

df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]


    A  B
1  21  1
3  50  3
5  24  5
9  40  9

On performance: .query() slows things down 2x:

Click to copy

df = DataFrame(data={'A': np.random.randint(0, 100, 1000), 'B': np.arange(1000)})

def query(df):
    low, high = df.B.quantile([0.25,0.75])
    df.query('{low}<B<{high}'.format(low=low,high=high))

%timeit query(df)
1000 loops, best of 3: 1.81 ms per loop

def between(df):
    df[df.A.between(*df.A.quantile([0.25, 0.75]).tolist())]

%timeit between(df)
1000 loops, best of 3: 995 µs per loop

@Alexander's solution performs identical to the one using .between().

answered Sep 30 '22 17:09

Stefan

Related questions
                            
                                Adding the resulting TFIDF calculation to the dataframe of the original documents in Pyspark
                            
                                Python Lists : Why new list object gets created after concatenation operation?
                            
                                How to make tkinter Window floating in i3 windowmanager
                            
                                why 1 // 0.05 results in 19.0 in python?
                            
                                Efficiently processing ~50 million record file in python
                            
                                How to return the regex that matches some text?
                            
                                ImportError: No module named 'jupyter_client
                            
                                How to find out size of a PhotoImage in Tkinter?
                            
                                String alignment in Tkinter
                            
                                How to set the spaces in a string format in Python 3
                            
                                timestamp string (Unix time) to datetime or pandas.Timestamp
                            
                                Selecting values from non-null columns in a PySpark DataFrame
                            
                                Betweenness centrality in NetworkX: logical error
                            
                                How to Pass JSON data from Django view to Vue.js instance methods
                            
                                POST data to Firebase using Python
                            
                                TypeError: string indices must be integers, not str // Trying to get value of key
                            
                                module initialization error: 'module' object has no attribute 'read_dotenv'
                            
                                Passing **kwargs received in a wrapper-function definition, to arguments of an enclosed (i.e. wrapped) function call
                            
                                How to retrieve only arabic texts from a string using regular expression?
                            
                                How to stop another already running script in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to filter rows that fall within 1st and 3rd quartile of a particular column in pandas dataframe?

Tags:

python

pandas

dataframe

python-2.7

Neel Shah

People also ask

3 Answers

PhilChang

Alexander

Stefan

Recent Activity

Donate For Us