Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R summary() gives incorrect values for too many NAs

Tags:

r

The Setup

I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary() on it, I get a mean and maximum that are wrong (in disagreement with mean() and max()).

> summary(data, digits = 10)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
 0       0       1       1       1       1 4365239 

When mean() is called separately, it returns a reasonable value:

> mean(data, na.rm = T)
[1] 0.6804823

Characterization of the problem

It looks like this problem is generic to any vector with more than 3162277 NA values in it.

With just under the cutoff:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.5     0.5     1.0     1.0 3162277 

And just over:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

It doesn't seem to matter how many non-missing values there are either.

> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.2     0.5     0.5     0.8     1.0 3162277 
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

Research

  1. In searching for an answer, I came across the well-known rounding error, but that doesn't affect this behavior (see the first code chunk).
  2. I thought this might be some sort of bizarre quirk of my environment/machine/planetary alignment, so I had my sister run the same code. She got the same results on her machine.

Closing remarks

Clearly, this isn't a critical problem because the mean() and max() functions can be used instead of summary(), but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.

EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.

like image 313
Cathy Avatar asked Oct 30 '22 14:10

Cathy


1 Answers

Here's some example data:

x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000

mean(x, na.rm=TRUE)
#[1] 0.3181818

The issue can be traced back to zapsmall() as it does some rounding in a line that essentially does:

c(out)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06

round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000 

The critical turning point here is 3162277 to 3162278 NA values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.

dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876

dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896

out[7] <- 3162277
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0.0     0.0     0.0     0.3     1.0     1.0 3162277 

out[7] <- 3162278
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#      0       0       0       0       1       1 3162278
like image 190
thelatemail Avatar answered Nov 15 '22 05:11

thelatemail