I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?
When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences.
Removing Outliers using Standard Deviation. Another way we can remove outliers is by calculating upper boundary and lower boundary by taking 3 standard deviation from the mean of the values (assuming the data is Normally/Gaussian distributed).
Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers. Other outliers are problematic and should be removed because they represent measurement errors, data entry or processing errors, or poor sampling.
In this method, we completely remove data points that are outliers. Consider the 'Age' variable, which had a minimum value of 0 and a maximum value of 200. The first line of code below creates an index for all the data points where the age takes these two values.
You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too
One way would be to exclude the outliers with a not in
clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in
clause for the bottom five percent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With