Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to winsorize (or remove univariate outliers) in a longitudinal dataset

Tags:

r

I am trying to figure out how to winsorize observations grouped by individuals in a longitudinal dataset.

I started off with this excellent answer about how to remove data >2 standard deviations from the mean of a variable. The author also helpfully shows how to do this within categories.

My use case is slightly different: I have a longitudinal dataset, and I want to remove individuals who are, over time, systematically shown to be outliers. Rather than taking out the extreme observations within subjects, I'd like to either exclude those individuals altogether (trimming the data) or replacing the bottom and top 2.5% with the cut value (winsorizing, see: http://en.wikipedia.org/wiki/Winsorising).

For example, my long-form data might look like this:

name time points
MJ   1    998
MJ   2    1000
MJ   3    998
MJ   4    3000
MJ   5    998
MJ   5    420
MJ   6    999
MJ   7    998
Lebron   1    9
Lebron   2    1
Lebron   3    3
Lebron   4    900
Lebron   5    4
Lebron   5    4
Lebron   6    3
Lebron   7    8
Kobe   1    2
Kobe   2    1
Kobe   3    4
Kobe   4    2
Kobe   5    1000
Kobe   5    4
Kobe   6    7
Kobe   7    9
Larry   1    2
Larry   2    1
Larry   3    4
Larry   4    2
Larry   5    800
Larry   5    4
Larry   6    7
Larry   7    9

If I wanted to remove the extreme observations in points within individuals (name), my code would be:

do.call(rbind,by(df,df$name,function(x) x[!abs(scale(x$points)) > 2,]))

But what I really want to do is exclude the INDIVIDUAL who is extreme (in this case, MJ). How would I go about doing that?

(P.S. - insert here all of the caveats about how one should not remove outliers. This is just a robustness test!)

like image 842
roody Avatar asked Feb 21 '14 23:02

roody


1 Answers

I would just use dplyr:

test <- read.csv("test.csv", header=TRUE)
library(dplyr)

test <- test %.% 
  group_by(name) %.% 
  mutate(mean_points=mean(points))

cut_point_top <- quantile(test$mean_points, 0.95)
cut_point_bottom <- quantile(test$mean_points, 0.05)

test <- test %.% 
  group_by(name) %.% 
  mutate(outlier_top = (mean_points >= cut_point_top), 
         outlier_bottom = mean_points <= cut_point_bottom) %.%
  filter(!outlier_top & ! outlier_bottom)

This filters out MJ as having a mean score in the top 2.5% and Larry as being in the bottom 2.5%.

If you want to replace the points variable with the cut points for the 2.5 percentiles, just drop the last filter statement like so:

test <- test %.% 
  group_by(name) %.% 
  mutate(outlier_top = (mean_points >= cut_point_top), 
         outlier_bottom = mean_points <= cut_point_bottom) 

test$points <- ifelse(test$outlier_top, cut_point_top, 
                      ifelse(test$outlier_bottom, cut_point_bottom, test$points))
like image 166
user2987808 Avatar answered Oct 13 '22 12:10

user2987808