Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I exclude outliers from an aggregate query?

Tags:

sql-server

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:

SELECT  m.Unit,
        COUNT(*) AS Count,
        SUM(m.TimeInMinutes) AS TotalTime
FROM    main_table m
WHERE   m.unit <> ''
        AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING  COUNT(*) > 15

However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)

How do I do that?

like image 684
Margaret Avatar asked Jan 17 '11 20:01

Margaret


People also ask

How do you omit an outlier?

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.

How do you remove outliers from a distribution?

Removing Outliers using Standard Deviation. Another way we can remove outliers is by calculating upper boundary and lower boundary by taking 3 standard deviation from the mean of the values (assuming the data is Normally/Gaussian distributed).

Do you exclude outliers from data?

Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers. Other outliers are problematic and should be removed because they represent measurement errors, data entry or processing errors, or poor sampling.

Does data cleaning remove outliers?

In this method, we completely remove data points that are outliers. Consider the 'Age' variable, which had a minimum value of 0 and a maximum value of 200. The first line of code below creates an index for all the data points where the age takes these two values.


2 Answers

You can exclude the top and bottom x percentiles with NTILE

SELECT m.Unit,
        COUNT(*) AS Count,
        SUM(m.TimeInMinutes) AS TotalTime
FROM    
        (SELECT
             m.Unit,
             NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
         FROM
             main_table m
         WHERE
             m.unit <> '' AND m.TimeInMinutes > 0
        ) m
WHERE   
      Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING  COUNT(*) > 15

Edit: this article has several techniques too

like image 147
gbn Avatar answered Sep 22 '22 12:09

gbn


One way would be to exclude the outliers with a not in clause:

where  m.ID not in 
       (
       select  top 5 percent ID
       from    main_table 
       order by 
               TimeInMinutes desc
       )

And another not in clause for the bottom five percent.

like image 28
Andomar Avatar answered Sep 20 '22 12:09

Andomar