Overlay density plot excludes histogram values

Tags:

I want to overlay a density curve to a frequency histogram I have constructed. For the frequency histogram I used aes(y=..counts../40) because 40 is my total sample number. I used aes(y=..density..*0.1) to force the density to be somewhere between 0 and 1 since my binwidth is 0.1. However, density curve doesn't fit my data and it excludes the values that are equal to 1.0 (notice that the histogram shows accumulation values for the bin=(1.0,1.1) but the density curve ends at 1.0)

this is my data

data<-structure(list(variable = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("E1", "test"
), class = "factor"), value = c(0.288888888888889, 0.0817901234567901, 
0.219026548672566, 0.584795321637427, 0.927554980595084, 0.44661095636026, 
1, 0.653780942692438, 1, 0.806451612903226, 1, 0.276794335371741, 
1, 0.930109557990178, 0.776864728192162, 0.824909747292419, 1, 
1, 1, 1, 1, 0.0875912408759124, 0.308065494238933, 1, 0.0258064516129032, 
0.0167322834645669, 1, 1, 0.355605889014723, 0.310344827586207, 
0.106598984771574, 0.364447494852436, 0.174724342663274, 0.77491961414791, 
1, 0.856026785714286, 0.680759275237274, 0.850657108721625, 1, 
1, 0, 0.851851851851852, 1, 0, 0.294954721862872, 0.819870009285051, 
0, 0.734147168531706, 0.0135424091233072, 0.0189098998887653, 
0.0101010101010101, 0, 0.296905222437137, 0.706837929731772, 
0.269279393173198, 0.135379061371841, 0.158969804618117, 0.0902981940361193, 
0.00423131170662906, 0, 0.374880611270296, 0.0425790754257908, 
0.145542753183748, 0, 0.129032258064516, 0.260334645669291, 0, 
0, 1, 0.175505350772889, 0.08248730964467, 0, 0.317217981340119, 
0.614147909967846, 0, 0.264508928571429, 0.883520276100086, 0.0657108721624851, 
0, 0.560229445506692)), row.names = c(NA, -80L), .Names = c("variable", 
"value"), class = "data.frame")

Plot

q<-ggplot(data, aes(value, fill = variable)) 
        q + geom_density(alpha = 0.6,aes(y=..density..*0.1),binwidth=0.1) 
    + theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) 
    + theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) 
    + labs(title="Density Plot GrupoB",x="Respuesta",y="Density")
    +scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))
    +geom_histogram(alpha = 0.6,aes(y=..count../40),binwidth=0.1,position="dodge")

The output I get is this enter image description here

818

asked May 17 '15 15:05

Matias Andina

1 Answers

Your plot is doing exactly what is to be expected from your data:

You plot data$value, which contains numeric values between 0 and 1, so you should expect the density curve to run from 0 to 1 as well.
You plot a histogram with binwidth 0.1. Bins are closed on the lower and open on the upper end. So the binning you get in your case is [0,0.1), [0.1, 0.2), ..., [0.9,1.0), [1.0,1.1). You have 17 values in your data that are 1 and thus go into the last bin, which is plotted from 1 to 1.1.

I think it's a bad idea to plot the histogram the way you do. The reason is that for a histogram, the x-axis is continuous, meaning that the bar that covers the x-axis range from, say, 0.1 to 0.2 stands for the count of values between (and including) 0.1 and 0.2 (not including the latter). Using dodge in this situation leads to a distorted picture, since the bars do now no longer cover the correct x-axis range. Two bars share the range that should be covered in full by both of them. This distortion is one of the reasons why the density curve seems not to match the histogram.

So, what can you do about it? I can give you a few suggestions, but maybe others have better ideas...

Instead of plotting the histograms next to each other with position="dodge", you could use faceting, that is, plot the histograms (and corresponding density curves) into separate plots. This can be achieved by adding + facet_grid(variable~.) to your plot.
You could cheat a little bit to have the last bin, which is [0.9,1), include 1 (i.e. have it be [0.9,1.0]). Simply replace 1 in your data by 0.999 as follows: data$value[data$value==1]<-0.999. It is important that you do this only for the plot, where it really only means that you slightly redefine the binning. For all the numeric evaluations that you indent to do, you should not do this replacement! (It will, e.g., change the mean of data$value.)
Regarding the normalisation of your density curve and the histogram: there is no need for the density curve to lie between 0 and 1. The restriction is that the integral over the density curve should be 1. Thus, to make density curve and histogram compareable, also the histogram should have integral 1, which is achieved, by also dividing the y-value by the bindwidth. So, you should use geom_density(alpha = 0.6,aes(y=..density..)) (I also removed bindwith=0.1 because it has no effect for geom_density) and geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) (no need for position="dodge", once you use faceting). This leads, of course, to exactly the relative normalisation that you had, but it makes more sense because the integrals over density curve and histogram are 1, as they should be.
The density curve does still not perfectly match the histogram and this has to do with the way the density estimator is calculated. I don't know this in detail and can thus unfortunately not explain it further. But you can get a better understanding of how it works by playing with the parameter adjust to geom_density. It will make the curve less smooth for smaller numbers and the curve will resemble the histogram more closely.

To put everything together, I have built all my suggestions into your code, used adjust=0.2 in geom_density and plotted the result:

data$value[data$value==1]<-0.999
q<-ggplot(data, aes(value, fill = variable)) 
q + geom_density(alpha = 0.6,aes(y=..density..),adjust=0.2)  +
   theme_minimal()+scale_fill_manual(values =c("#D7191C","#2B83BA")) +
   theme(legend.position="bottom")+ guides(fill=guide_legend(nrow=1)) +
   labs(title="Density Plot GrupoB",x="Respuesta",y="Density")+
   scale_x_continuous(breaks=seq(from=0,to=1.2,by=0.1))+
   geom_histogram(alpha = 0.6,aes(y=..count../40/.1),binwidth=0.1) +
   facet_grid(variable~.)

enter image description here

Unfortunately, I can not give you a more complete answer, but I hope these ideas give you a good start.

177

answered Sep 24 '22 18:09

Stibu

Related questions
                            
                                Which selector to write in rvest package in R?
                            
                                R data.table replace NA with mean for numeric columns and most frequent value for nominal values
                            
                                Doing absolute descending sort of data.table through function?
                            
                                Efficient calling of F95 in R: use .Fortran or .Call?
                            
                                How to calculate dynamic panel models with lfe package
                            
                                Compiling RMarkdown with RStudio: why reading .RProfile?
                            
                                Count based on multiple conditions from other data.frame
                            
                                how to automatically update a slot of S4 class in R
                            
                                Subset n number of rows from a dataframe, based on a categorical variable, in R
                            
                                Icons as x-axis labels in R
                            
                                fit 2d surface using LOESS in R
                            
                                Building R packages with Packrat and AppVeyor
                            
                                Adding point and lines to 3D scatter plot in R
                            
                                R summarize unique values across columns based on values from one column
                            
                                How can I get a percentile value for each dataframe row considering a subset of the data?
                            
                                Removing Rows Based on Not Enough Repeated Data in a Large Data Set in R
                            
                                R: setting multiple plot heights with par
                            
                                Is there something in R for automatic conversion of column( of data frame or table) into its original vector type
                            
                                Aggregating in R
                            
                                Interpolate values from a grid efficiently in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Overlay density plot excludes histogram values

Tags:

r

ggplot2

frequency

kernel-density

Matias Andina

People also ask

1 Answers

Stibu

Recent Activity

Donate For Us