I've just started with R and I've executed these statements:
library(datasets)
head(airquality)
s <- split(airquality,airquality$Month)
sapply(s, function(x) {colMeans(x[,c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)})
lapply(s, function(x) {colMeans(na.omit(x[,c("Ozone", "Solar.R", "Wind")])) })
For the sapply
, it returns the following:
5 6 7 8 9
Ozone 23.61538 29.44444 59.115385 59.961538 31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind 11.62258 10.26667 8.941935 8.793548 10.18000
And for lapply
, it returns the following:
$`5`
Ozone Solar.R Wind
24.12500 182.04167 11.50417
$`6`
Ozone Solar.R Wind
29.44444 184.22222 12.17778
$`7`
Ozone Solar.R Wind
59.115385 216.423077 8.523077
$`8`
Ozone Solar.R Wind
60.00000 173.08696 8.86087
$`9`
Ozone Solar.R Wind
31.44828 168.20690 10.07586
Now, my question would be, why are the returned values similar, but not the same? Isn't na.rm = TRUE
and na.omit
supposed to be doing the exact same thing? Omit the missing values and calculate the mean only for the values that we have? And in that case, shouldn't I have had the same values in both result sets?
Thank you so much for any input!
omit() function in R Language is used to omit all unnecessary cases from data frame, matrix or vector. Syntax: na.omit(data)
na. fail returns the object if it does not contain any missing values, and signals an error otherwise. na. omit returns the object with incomplete cases removed.
You can use the argument na. rm = TRUE to exclude missing values when calculating descriptive statistics in R.
To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na. omit(df).
They are not supposed to give the same result. Consider this example:
exdf<-data.frame(a=c(1,NA,5),b=c(3,2,2))
# a b
#1 1 3
#2 NA 2
#3 5 2
colMeans(exdf,na.rm=TRUE)
# a b
#3.000000 2.333333
colMeans(na.omit(exdf))
# a b
#3.0 2.5
Why is this? In the first case, the mean of column b
is calculated through (3+2+2)/3
. In the second case, the second row is removed in its entirety (also the value of b
which is not-NA and therefore considered in the first case) by na.omit
and so the b
mean is just (3+2)/2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With