Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overlay density plot excludes histogram values

I want to overlay a density curve to a frequency histogram I have constructed. For the frequency histogram I used aes(y=..counts../40) because 40 is my total sample number. I used aes(y=..density..*0.1) to force the density to be somewhere between 0 and 1 since my binwidth is 0.1. However, density curve doesn't fit my data and it excludes the values that are equal to 1.0 (notice that the histogram shows accumulation values for the bin=(1.0,1.1) but the density curve ends at 1.0)

this is my data

data<-structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("E1", "test"
), class = "factor"), value = c(0.288888888888889, 0.0817901234567901, 
0.219026548672566, 0.584795321637427, 0.927554980595084, 0.44661095636026, 
1, 0.653780942692438, 1, 0.806451612903226, 1, 0.276794335371741, 
1, 0.930109557990178, 0.776864728192162, 0.824909747292419, 1, 
1, 1, 1, 1, 0.0875912408759124, 0.308065494238933, 1, 0.0258064516129032, 
0.0167322834645669, 1, 1, 0.355605889014723, 0.310344827586207, 
0.106598984771574, 0.364447494852436, 0.174724342663274, 0.77491961414791, 
1, 0.856026785714286, 0.680759275237274, 0.850657108721625, 1, 
1, 0, 0.851851851851852, 1, 0, 0.294954721862872, 0.819870009285051, 
0, 0.734147168531706, 0.0135424091233072, 0.0189098998887653, 
0.0101010101010101, 0, 0.296905222437137, 0.706837929731772, 
0.269279393173198, 0.135379061371841, 0.158969804618117, 0.0902981940361193, 
0.00423131170662906, 0, 0.374880611270296, 0.0425790754257908, 
0.145542753183748, 0, 0.129032258064516, 0.260334645669291, 0, 
0, 1, 0.175505350772889, 0.08248730964467, 0, 0.317217981340119, 
0.614147909967846, 0, 0.264508928571429, 0.883520276100086, 0.0657108721624851, 
0, 0.560229445506692)), row.names = c(NA, -80L), .Names = c("variable", 
"value"), class = "data.frame")

Plot

q<-ggplot(data, aes(value, fill = variable)) 
        q + geom_density(alpha = 0.6,aes(y=..density..*0.1),binwidth=0.1) 
    + theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) 
    + theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) 
    + labs(title="Density Plot GrupoB",x="Respuesta",y="Density")
    +scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))
    +geom_histogram(alpha = 0.6,aes(y=..count../40),binwidth=0.1,position="dodge")

The output I get is this enter image description here

like image 818
Matias Andina Avatar asked May 17 '15 15:05

Matias Andina


People also ask

How do you overlay a density plot on a histogram?

In this method to overlay the histogram with the fitted density curve, the user first needs to install and import the ggplot2 package in the R console, and call the ggplot() function which will create the plot of the given data with the required parameters and add the geom_histogram() to create the histogram of the ...

What does density mean on a histogram?

It is the area of the bar that tells us the frequency in a histogram, not its height. Instead of plotting frequency on the y-axis, we plot the frequency density. To calculate this, you divide the frequency of a group by the width of it.

How do you plot a histogram on a curve in R?

You can plot a histogram in R with the hist function. By default, the function will create a frequency histogram. However, if you set the argument prob to TRUE , you will get a density histogram.


1 Answers

Your plot is doing exactly what is to be expected from your data:

  • You plot data$value, which contains numeric values between 0 and 1, so you should expect the density curve to run from 0 to 1 as well.
  • You plot a histogram with binwidth 0.1. Bins are closed on the lower and open on the upper end. So the binning you get in your case is [0,0.1), [0.1, 0.2), ..., [0.9,1.0), [1.0,1.1). You have 17 values in your data that are 1 and thus go into the last bin, which is plotted from 1 to 1.1.

I think it's a bad idea to plot the histogram the way you do. The reason is that for a histogram, the x-axis is continuous, meaning that the bar that covers the x-axis range from, say, 0.1 to 0.2 stands for the count of values between (and including) 0.1 and 0.2 (not including the latter). Using dodge in this situation leads to a distorted picture, since the bars do now no longer cover the correct x-axis range. Two bars share the range that should be covered in full by both of them. This distortion is one of the reasons why the density curve seems not to match the histogram.

So, what can you do about it? I can give you a few suggestions, but maybe others have better ideas...

  • Instead of plotting the histograms next to each other with position="dodge", you could use faceting, that is, plot the histograms (and corresponding density curves) into separate plots. This can be achieved by adding + facet_grid(variable~.) to your plot.

  • You could cheat a little bit to have the last bin, which is [0.9,1), include 1 (i.e. have it be [0.9,1.0]). Simply replace 1 in your data by 0.999 as follows: data$value[data$value==1]<-0.999. It is important that you do this only for the plot, where it really only means that you slightly redefine the binning. For all the numeric evaluations that you indent to do, you should not do this replacement! (It will, e.g., change the mean of data$value.)

  • Regarding the normalisation of your density curve and the histogram: there is no need for the density curve to lie between 0 and 1. The restriction is that the integral over the density curve should be 1. Thus, to make density curve and histogram compareable, also the histogram should have integral 1, which is achieved, by also dividing the y-value by the bindwidth. So, you should use geom_density(alpha = 0.6,aes(y=..density..)) (I also removed bindwith=0.1 because it has no effect for geom_density) and geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) (no need for position="dodge", once you use faceting). This leads, of course, to exactly the relative normalisation that you had, but it makes more sense because the integrals over density curve and histogram are 1, as they should be.

  • The density curve does still not perfectly match the histogram and this has to do with the way the density estimator is calculated. I don't know this in detail and can thus unfortunately not explain it further. But you can get a better understanding of how it works by playing with the parameter adjust to geom_density. It will make the curve less smooth for smaller numbers and the curve will resemble the histogram more closely.

To put everything together, I have built all my suggestions into your code, used adjust=0.2 in geom_density and plotted the result:

data$value[data$value==1]<-0.999
q<-ggplot(data, aes(value, fill = variable)) 
q + geom_density(alpha = 0.6,aes(y=..density..),adjust=0.2)  +
   theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) +
   theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) +
   labs(title="Density Plot GrupoB",x="Respuesta",y="Density")+
   scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))+
   geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) +
   facet_grid(variable~.)

enter image description here

Unfortunately, I can not give you a more complete answer, but I hope these ideas give you a good start.

like image 177
Stibu Avatar answered Sep 24 '22 18:09

Stibu