I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary()
on it, I get a mean and maximum that are wrong (in disagreement with mean()
and max()
).
> summary(data, digits = 10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 1 1 1 1 4365239
When mean()
is called separately, it returns a reasonable value:
> mean(data, na.rm = T)
[1] 0.6804823
It looks like this problem is generic to any vector with more than 3162277 NA values in it.
With just under the cutoff:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.5 0.5 1.0 1.0 3162277
And just over:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
It doesn't seem to matter how many non-missing values there are either.
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.2 0.5 0.5 0.8 1.0 3162277
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
Clearly, this isn't a critical problem because the mean()
and max()
functions can be used instead of summary()
, but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.
EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.
Here's some example data:
x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
mean(x, na.rm=TRUE)
#[1] 0.3181818
The issue can be traced back to zapsmall()
as it does some rounding in a line that essentially does:
c(out)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06
round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
The critical turning point here is 3162277
to 3162278
NA
values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.
dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876
dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896
out[7] <- 3162277
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.0 0.0 0.0 0.3 1.0 1.0 3162277
out[7] <- 3162278
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 3162278
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With