Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

about grubbs test for outlier detection in R

Tags:

r

outliers

I followed the program codes in the web site of How to repeat the Grubbs test and flag the outliers, and tested outliers in my data vector. My data vector contains more 44000 items.

The output is as follows:

grubbs.result = grubbs.test(test_data)

pvalue = grubbs.result$p.value

grubbs.result

Grubbs test for one outlier
data:  test_data
G = 3.79551464153584561, U = 0.99967764032789053, p-value = 1
alternative hypothesis: highest value -48.70000076 is an outlier

pvalue

[1] 1

grubbs.result$alternative

[1] "highest value -48.70000076 is an outlier"

My question is that why the pvalue is 1, but the program detects the value -48.70000076 is an outlier??? Is -48.70000076 an outlier detected by grubbs test or not? If yes, how to explain the pvalue is 1, not a small value like 0.01?

Because I am a new learner in this field, could anyone give me any help? Thank you very much in advance.

like image 834
hong Avatar asked Mar 15 '23 01:03

hong


1 Answers

This is more a question for CV, but I'll give a quick stats lesson. The most important thing to know when looking for outliers is that unless you have a valid, non-statistical reason, no data point, no matter how different from the rest of the data, is truly an outlier. Those extreme data points are part of your data--they belong.

Some data really are outliers, but not because Grubb's test says so. For instance, it you're taking people's temperatures, and one person put his thermometer next to a light bulb, their temperature might be considered an outlier. If someone else just drank a cold glass of water and it was an oral measurement, they also could be considered an outlier. But if someone has the same temperature as a person who just drank a cold glass of water, that does not necessarily qualify as an outlier. We don't say something is an outlier for statistical reasons alone.

That disclaimer aside, we can address the core issue here, and it's statistical. The test is stating what the alternative hypothesis is, which is that the most extreme data point is an outlier. In this case, `-48.70000076' happens to be further away from the other data points than any other data point, so our null hypothesis is that no data points are outliers--including that most extreme point. The alternative hypothesis, which we will conclude if we reject the null hypothesis, is that at the very least that most extreme point is an outlier (statistically).

In this case, the p-value indicates that you have no evidence whatsoever that any of your data are outliers because the p-value is 1:

set.seed(123)
test0 <- runif(1000) 
test_data <- test0-max(test0)-48.70000076
grubbs.test(test_data)

#     Grubbs test for one outlier

# data:  test_data
# G = 1.74660, U = 0.99694, p-value = 1
# alternative hypothesis: highest value -48.70000076 is an outlier
like image 187
Sam Dickson Avatar answered Mar 16 '23 16:03

Sam Dickson