Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is a histogram for normal samples rougher near the mode than near the tail?

I am trying to understand a particular behavior of the histogram of samples generated from rnorm.

set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)

plot.hist <- function(vec, title, brks) {
  h <- hist(vec, breaks = brks, density = 10,
            col = "lightgray", main = title) 
  xfit <- seq(min(vec), max(vec), length = 40) 
  yfit <- dnorm(xfit, mean = mean(vec), sd = sd(vec)) 
  yfit <- yfit * diff(h$mids[1:2]) * length(vec) 
  return(lines(xfit, yfit, col = "black", lwd = 2))
}

par(mfrow = c(2, 2))
plot.hist(x1, title = 'Sample = 1E3', brks = 100)
plot.hist(x2, title = 'Sample = 1E4', brks = 500)
plot.hist(x3, title = 'Sample = 1E5', brks = 1000)
plot.hist(x4, title = 'Sample = 1E6', brks = 1000)

enter image description here

You will notice that in each case (I am not making cross comparison; I know that as sample size gets larger the match between histogram and the curve is better), the histogram approximates the standard normal better towards the tails, but poorer towards the mode. Simply put, I'm trying to understand why each histogram is rougher in the middle compared to the tails. Is this an expected behavior or have I missed something basic?

like image 447
Ameya Avatar asked Jul 24 '18 22:07

Ameya


2 Answers

This is not just true for normal samples. If we had fixed bins (rather than data-determined ones as we normally do) and we condition on the total number of observations, then the counts would be multinomial.

The expected value of the count in bin i is then n·p(i) where p(i) is the proportion of the population density that falls in bin (i).

The variance of the count in bin i would then be n·p(i)·(1-p(i)). With many bins and a smooth non-peaky density like a normal, (1-p(i)) will be very close to 1; p(i) will typically be smallish (much smaller that 1/2).

variance function for a bernoulli p(1-p) vs p

The variance of the count (and hence the standard deviation of it) is an increasing function of the expected height:

With a fixed bin width the height is proportional to the expected count and the standard deviation of the bin-height is an increasing function of the height.

So this motivates exactly what you see.

In practice it's not the case that the bin boundaries are fixed; as you add observations or generate a new sample they will change, but the number of bins changes fairly slowly as a function of the sample size (typically as the cube root, or sometimes as the log) and a more sophisticated analysis than the one here is required to get the exact form. However, the outcome is the same -- under commonly observed conditions the variance of the height of a bin typically increases monotonically with the height of the bin.

like image 39
Glen_b Avatar answered Sep 20 '22 14:09

Glen_b


Our eyes are fooling us. The density near the mode is high so that we can observe the variation more evidently. The density near the tail is so low so that we can not really spot anything. The following code performs sort of a "standardization", allowing us to visualize the variation on a relative scale.

set.seed(1)
x1 <- rnorm(1000L)
x2 <- rnorm(10000L)
x3 <- rnorm(100000L)
x4 <- rnorm(1000000L)

foo <- function(vec, title, brks) {
  ## bin estimation
  h <- hist(vec, breaks = brks, plot = FALSE)
  ## compute true probability between adjacent break points
  p2 <- pnorm(h$breaks[-1])
  p1 <- pnorm(h$breaks[-length(h$breaks)])
  p <- p2 - p1
  ## compute estimated probability between adjacent break points
  phat <- h$count / length(vec)
  ## compute and plot their absolute relative difference
  v <- abs(phat - p) / p
  ##plot(h$mids, v, main = title)
  ## plotting on log scale is much better!!
  v.log <- log(1 + v)
  plot(h$mids, v.log, main = title)
  ## invisible return
  invisible(list(v = v, v.log = v.log))
  }

par(mfrow = c(2, 2))
v1 <- foo(x1, title = 'Sample = 1E3', brks = 100)
v2 <- foo(x2, title = 'Sample = 1E4', brks = 500)
v3 <- foo(x3, title = 'Sample = 1E5', brks = 1000)
v4 <- foo(x4, title = 'Sample = 1E6', brks = 1000)

enter image description here

The relative variation is the lowest near the middle (toward 0), but very high near the two edges. This is well explained in statistics:

  • We have more samples near the middle, so (sample sd) : (sample mean) there is lower;
  • We have few samples near the edge, maybe 1 or 2, so (sample sd) : (sample mean) there is big.

a little explanation on the log-transform I take

v.log = log(1 + v). Its Taylor expansion ensures that v.log is close to v for very small v around 0. As v gets larger, log(1 + v) gets closer to log(v), thus the usual log-transform is recovered.

like image 162
Zheyuan Li Avatar answered Sep 22 '22 14:09

Zheyuan Li