I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.
Any advice would be much appreciated, I can't find any information on how to do this anywhere else.
Use Mean Detection and Nearest Fill Methods Fill outliers in the data, where an outlier is defined as a point more than three standard deviations from the mean. Replace the outlier with the nearest element that is not an outlier. In the same graph, plot the original data and the data with the outlier filled.
One can identify all "outliers" at once and replace all of them with the mean of the remainder. This is a consistent procedure not unlike Winsorizing. You argue against replacing outliers with a value that is dependent on the other values in the data.
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal.
You can do it in one line of code using squish()
:
d2 <- squish(d, quantile(d, c(.05, .95)))
In the scales library, look at ?squish
and ?discard
#--------------------------------
library(scales)
pr <- .95
q <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------
# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))
example:
d <- 1:20
d
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1] 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19
This would do it.
fun <- function(x){
quantiles <- quantile( x, c(.05, .95 ) )
x[ x < quantiles[1] ] <- quantiles[1]
x[ x > quantiles[2] ] <- quantiles[2]
x
}
fun( yourdata )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With