Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to remove outliers from a smoother in ggplot2?

I have the following data set that I am trying to plot with ggplot2, it is a time series of three experiments A1, B1 and C1 and each experiment had three replicates.

I am trying to add a stat which detects and removes outliers before returning a smoother (mean and variance?). I have written my own outlier function (not shown) but I expect there is already a function to do this, I just have not found it.

I've looked at stat_sum_df("median_hilow", geom = "smooth") from some examples in the ggplot2 book, but I didn't understand the help doc from Hmisc to see if it removes outliers or not.

Is there a function to remove outliers like this in ggplot, or where would I amend my code below to add my own function?

EDIT: I just saw this (How to use Outlier Tests in R Code) and notice that Hadley recommends using a robust method such as rlm. I am plotting bacterial growth curves, so I don't think a linear model is best, but any advice on other models or using or using robust models in this situation would be appreciated.

library (ggplot2)  

data = data.frame (day = c(1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7,1,3,5,7), od = 
c(
0.1,1.0,0.5,0.7
,0.13,0.33,0.54,0.76
,0.1,0.35,0.54,0.73
,1.3,1.5,1.75,1.7
,1.3,1.3,1.0,1.6
,1.7,1.6,1.75,1.7
,2.1,2.3,2.5,2.7
,2.5,2.6,2.6,2.8
,2.3,2.5,2.8,3.8), 
series_id = c(
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"A1", "A1", "A1","A1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"B1", "B1","B1", "B1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1",
"C1","C1", "C1", "C1"),
replicate = c(
"A1.1","A1.1","A1.1","A1.1",
"A1.2","A1.2","A1.2","A1.2",
"A1.3","A1.3","A1.3","A1.3",
"B1.1","B1.1","B1.1","B1.1",
"B1.2","B1.2","B1.2","B1.2",
"B1.3","B1.3","B1.3","B1.3",
"C1.1","C1.1","C1.1","C1.1",
"C1.2","C1.2","C1.2","C1.2",
"C1.3","C1.3","C1.3","C1.3"))

> data
   day   od series_id replicate
1    1 0.10        A1      A1.1
2    3 1.00        A1      A1.1
3    5 0.50        A1      A1.1
4    7 0.70        A1      A1.1
5    1 0.13        A1      A1.2
6    3 0.33        A1      A1.2
7    5 0.54        A1      A1.2
8    7 0.76        A1      A1.2
9    1 0.10        A1      A1.3
10   3 0.35        A1      A1.3
11   5 0.54        A1      A1.3
12   7 0.73        A1      A1.3
13   1 1.30        B1      B1.1
... etc...

This is what I have so far and is working nicely, but outliers are not removed:

r <- ggplot(data = data, aes(x = day, y = od))
r + geom_point(aes(group = replicate, color = series_id)) + # add points
   geom_line(aes(group = replicate, color = series_id)) + # add lines
   geom_smooth(aes(group = series_id))  # add smoother, average of each replicate

EDIT: I just added two charts below showing examples of the outlier problems that I'm having from the real data rather than the example data above.

The first plots shows series p26s4 and around day 32 something really weird went on in two of the replicates, showing 2 outliers.

The second plots shows series p22s5 and on day 18, something weird went on with the reading that day, likely machine error I think.

At the moment I am eye-balling the data, to check that the growth curves look OK. After taking Hadley's advice and setting family = "symmetric", I am confident that the loess smoother does a decent job of ignoring the outliers.

p26s4 shows around day 32 something really weird went on in two of the replicates, showing 2 outliers p22s5 shows that on day 18, something weird went on with the reading that day, likely machine error I think

@Peter/@hadley, the next thing I'd like to do is to try and fit a logistic, gompertz or richard's growth curve to this data instead of a loess and calculate the growth rate in the exponential stage. Eventually I plan to use the grofit package in R (http://cran.r-project.org/web/packages/grofit/index.html), but for now I'd like to plot these manually using ggplot2 if possible. If you have any pointers then it would be much appreciated.

like image 402
John Avatar asked Apr 10 '10 06:04

John


2 Answers

Have you tried the family = "symmetric" argument to geom_smooth (which will in turn get passed on to loess)? This will make the loess smooth resistant to outliers.

The syntax would be:

geom_smooth(method = loess, method.args = list(family = "symmetric"))

However, looking at your data, why do you think a linear fit is not adequate? You only have 4 x values, and there certainly doesn't seem to be strong evidence for a departure from linearity.

like image 189
hadley Avatar answered Sep 20 '22 06:09

hadley


First, I'm not sure an 'outlier' is even properly defined on such small data.

Second, you'd then have to decide what you mean by "outlier" that is, is it one of the drugs, one of the replicates, or one of the time points?

As Hadley notes, there is little evidence of deviation from linearity.

Finally, I think part of the point of using a smoother is that it deals well with outliers, provided there is enough data. But you have very little.

So, I have to ask exactly why you want to remove outliers. That is, what are you going to do with these data (besides making nice plots)?

I hope this helps

like image 24
Peter Flom Avatar answered Sep 20 '22 06:09

Peter Flom