Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to use pandas filter with IQR

Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested will be appreciated.

like image 467
Qijun Liu Avatar asked Jan 14 '16 05:01

Qijun Liu


People also ask

What is IQR filtering?

iqr. The Interquartile Range Filter automatically crops the input point cloud based on the distribution of points in the specified dimension. The Interquartile Range (IQR) is defined as the range between the first and third quartile (25th and 75th percentile).

How do you find the interquartile range of a column?

The interquartile range, often denoted “IQR”, is a way to measure the spread of the middle 50% of a dataset. It is calculated as the difference between the first quartile* (the 25th percentile) and the third quartile (the 75th percentile) of a dataset.


2 Answers

As far as I know, the most compact notation seems to be brought by the query method.

# Some test data np.random.seed(33454) df = (     # A standard distribution     pd.DataFrame({'nb': np.random.randint(0, 100, 20)})         # Adding some outliers         .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)}))         # Reseting the index         .reset_index(drop=True)     )  # Computing IQR Q1 = df['nb'].quantile(0.25) Q3 = df['nb'].quantile(0.75) IQR = Q3 - Q1  # Filtering Values between Q1-1.5IQR and Q3+1.5IQR filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)') 

Then we can plot the result to check the difference. We observe that the outlier in the left boxplot (the cross at 183) does not appear anymore in the filtered series.

# Ploting the result to check the difference df.join(filtered, rsuffix='_filtered').boxplot() 

Comparison before and after filterinf

Since this answer I've written a post on this topic were you may find more information.

like image 192
Romain Avatar answered Sep 20 '22 09:09

Romain


Another approach using Series.between():

iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)] 

Drawn out:

# Select the first quantile q1 = df['col'].quantile(.25)  # Select the third quantile q3 = df['col'].quantile(.75)  # Create a mask inbeetween q1 & q3 mask = df['col'].between(q1, q3, inclusive=True)  # Filtering the initial dataframe with a mask iqr = df.loc[mask, 'col']          
like image 32
MLenthousiast Avatar answered Sep 21 '22 09:09

MLenthousiast