Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ggplot2 geom_violin with 0 variance

Tags:

r

ggplot2

I started to really like violin plots, since they give me a much better feel that box plots when you have funny distributions. I like to automatize a lot of stuff, and thus ran into a problem: When one variable has 0 variance, the boxplot just gives you a line at that point. Geom_violin however, terminates with an error. What behavior would I like? Well, either put in a line or nothing, but please give me the distributions for the other variables.

Ok, quick example:

dff=data.frame(x=factor(rep(1:2,each=100)),y=c(rnorm(100),rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin()

yields

Error in `$<-.data.frame`(`*tmp*`, "n", value = 100L) : 
  replacement has 1 row, data has 0

However, what works is:

ggplot(dff,aes(x=x,y=y)) + geom_boxplot()

Update:

The issue is resolved as of yesterday: https://github.com/hadley/ggplot2/issues/972

Update 2: (from question author) Wow, Hadley himself responded! geom_violin now behaves consistently with geom_density and base R density.

However, I don't think the behavior is optimal yet.

(1) The 'zero' problem

Just run it with my original example:

dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rnorm(100), rep(0,100)))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)

Yielding this: enter image description here

Is the plot on the right an appropriate representation of 'all zeroes'? I don't think so. It is better to have trimming that produces a single line to show that there is no variation in the data. Workaround solution: Add a + geom_boxplot()

(2) I may actually want TRIM=TRUE.

Example:

dff=data.frame(x=factor(rep(1:2, each=100)), y=c(rgamma(100,1,1), rep(0,100)  ))
ggplot(dff,aes(x=x,y=y)) + geom_violin(trim=FALSE)

Now I have non-zero data, and standard kernel density estimates don't handle this correctly. With trim=T I can quickly see that the data is strictly positive.

I am not arguing that the current behavior is 'wrong', since it's in line with other functions. However, geom_violin may be used in different contexts, for exploring different data.frames with heterogeneous data types (positive+skewed or not, for instance).

like image 561
Inferrator Avatar asked Jun 09 '14 21:06

Inferrator


1 Answers

Three options for dealing with this until the ggplot2 issue is resolved:

  1. As a quick hack, you can set one of the y-values to 0.0001 (instead of zero) and geom_violin will work.
  2. Check out the vioplot package if you're not set on using ggplot2. vioplot doesn't throw an error when you feed it a bunch of identical values.
  3. The Hmisc package includes a panel.bpplot (box-percentile plot) function that can create violin plots with the bwplot function from the lattice package. See the Examples section of ?panel.bpplot. It produces a single line when you feed it a vector of identical values.
like image 85
eipi10 Avatar answered Oct 20 '22 22:10

eipi10