I want to add a density line (a normal density actually) to a histogram.
Suppose I have the following data. I can plot the histogram by ggplot2:
set.seed(123)    
df <- data.frame(x = rbeta(10000, shape1 = 2, shape2 = 4))
ggplot(df, aes(x = x)) + geom_histogram(colour = "black", fill = "white", 
                                        binwidth = 0.01) 

I can add a density line using:
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y = ..density..),colour = "black", fill = "white", 
                 binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)))

But this is not what I actually want, I want this density line to be fitted to the count data.
I found a similar post (HERE) that offered a solution to this problem. But it did not work in my case. I need to an arbitrary expansion factor to get what I want. And this is not generalizable at all:
ef <- 100 # Expansion factor
ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01) + 
  stat_function(fun = function(x, mean, sd, n){ 
    n * dnorm(x = x, mean = mean, sd = sd)}, 
    args = list(mean = mean(df$x), sd = sd(df$x), n = ef))

Any clues that I can use to generalize this
A basic histogram can be created with the hist function. In order to add a normal curve or the density line you will need to create a density histogram setting prob = TRUE as argument.
In order to add a density curve over a histogram you can use the lines function for plotting the curve and density for calculating the underlying non-parametric (kernel) density of the distribution. The bandwidth selection for adjusting non-parametric densities is an area of intense research.
To create a density plot in R you can plot the object created with the R density function, that will plot a density curve in a new R window. You can also overlay the density curve over an R histogram with the lines function. The result is the empirical density function.
Description. As known as Kernel Density Plots, Density Trace Graph. A Density Plot visualises the distribution of data over a continuous interval or time period. This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing for smoother distributions by smoothing out the noise.
Fitting a distribution function does not happen by magic. You have to do it explicitly. One way is using fitdistr(...) in the MASS package.
library(MASS)    # for fitsidtr(...)
# excellent fit (of course...)
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dbeta,args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

# horrible fit - no surprise here
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dnorm,args=fitdistr(df$x,"normal")$estimate)

# mediocre fit - also not surprising...
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dgamma,args=fitdistr(df$x,"gamma")$estimate)

EDIT: Response to OP's comment.
The scale factor is binwidth ✕ sample size.
ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=function(x,shape1,shape2)0.01*nrow(df)*dbeta(x,shape1,shape2),
                args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With