Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is sigma clipping? How do you know when to apply it?

I'm reading a book on Data Science for Python and the author applies 'sigma-clipping operation' to remove outliers due to typos. However the process isn't explained at all.

What is sigma clipping? Is it only applicable for certain data (eg. in the book it's used towards birth rates in US)?

As per the text:

quartiles = np.percentile(births['births'], [25, 50, 75]) #so we find the 25th, 50th, and 75th percentiles
mu = quartiles[1] #we set mu = 50th percentile
sig = 0.74 * (quartiles[2] - quartiles[0]) #???

This final line is a robust estimate of the sample mean, where the 0.74 comes 
from the interquartile range of a Gaussian distribution.

Why 0.74? Is there a proof for this?

like image 545
NRH Avatar asked Dec 05 '22 13:12

NRH


1 Answers

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution.

That's it, really...

The code tries to estimate sigma using the interquartile range to make it robust against outliers. 0.74 is a correction factor. Here is how to calculate it:

p1 = sp.stats.norm.ppf(0.25)  # first quartile of standard normal distribution
p2 = sp.stats.norm.ppf(0.75)  # third quartile
print(p2 - p1)  # 1.3489795003921634

sig = 1  # standard deviation of the standard normal distribution  
factor = sig / (p2 - p1)
print(factor)  # 0.74130110925280102

In the standard normal distribution sig==1 and the interquartile range is 1.35. So 0.74 is the correction factor to turn the interquartile range into sigma. Of course, this is only true for the normal distribution.

like image 145
MB-F Avatar answered Dec 16 '22 15:12

MB-F