Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding bandwidth smoothing in ggplot2

Tags:

r

ggplot2

realdata = https://www.dropbox.com/s/pc5tp2lfhafgaiy/realdata.txt

simulation = https://www.dropbox.com/s/5ep95808xg7bon3/simulation.txt

A density plot of this data using bandwidth=1.5 gives me the following plot:

prealdata = scan("realdata.txt")
simulation = scan("simulation.txt")
plot(density(log10(realdata), bw=1.5))
lines(density(log10(simulation), bw=1.5), lty=2)

enter image description here

But using ggplot2 to plot the same data, bandwidth argument (adjust) seems to be working differently. Why?

vec1 = data.frame(x=log10(realdata))
vec2 = data.frame(x=log10(simulation))
require(ggplot2)
ggplot() +
geom_density(aes(x=x, linetype="real data"), data=vec1, adjust=1.5) +
geom_density(aes(x=x, linetype="simulation"), data=vec2, adjust=1.5) +
scale_linetype_manual(name="data", values=c("real data"="solid", "simulation"="dashed"))

enter image description here

Suggestions on how to better smooth this data are also very welcome!

like image 485
vitor Avatar asked Jul 27 '14 20:07

vitor


People also ask

What does Geom_point () do in R?

The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

What does geom_density mean?

geom_density.Rd. Computes and draws kernel density estimate, which is a smoothed version of the histogram. This is a useful alternative to the histogram for continuous data that comes from an underlying smooth distribution.

What is bandwidth in density plot in R?

The kernel density plot is a non-parametric approach that needs a bandwidth to be chosen. You can set the bandwidth with the bw argument of the density function. In general, a big bandwidth will oversmooth the density curve, and a small one will undersmooth (overfit) the kernel density estimation in R.

What is density Ggplot?

Data Visualization using GGPlot2. A density plot is an alternative to Histogram used for visualizing the distribution of a continuous variable. The peaks of a Density Plot help to identify where values are concentrated over the interval of the continuous variable.


1 Answers

adjust= is not the same as bw=. When you plot

plot(density(log10(realdata), bw=1.5))
lines(density(log10(simulation), bw=1.5), lty=2)

you get the same thing as ggplot

enter image description here

For whatever reason, ggplot does not allow you to specify a bw= parameter. By default, density uses bw.nrd0() so while you changed this for the plot using base graphics, you cannot change this value using ggplot. But what get's used is adjust*bw. So since we know how to calculate the default bw, we can recalculate adjust= to give use the same value.

#helper function
bw<-function(b, x) { b/bw.nrd0(x) }

require(ggplot2)
ggplot() +
geom_density(aes(x=x, linetype="real data"), data=vec1, adjust=bw(1.5, vec1$x)) +
geom_density(aes(x=x, linetype="simulation"), data=vec2, adjust=bw(1.5, vec2$x)) +
scale_linetype_manual(name="data", 
    values=c("real data"="solid", "simulation"="dashed"))

And that results in

enter image description here

which is the same as the base graphics plot.

like image 51
MrFlick Avatar answered Oct 02 '22 09:10

MrFlick