Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error using geom_density_2d() in R : Computation failed in `stat_density2d()`: bandwidths must be strictly positive

Tags:

r

ggplot2

In a attempt to make a test 2d density plot with ggplot2, I used the code snippet:

ggplot(df, aes(x = S1.x, y = S1.y)) + geom_point() + geom_density_2d()

and I got the error: "Computation failed in stat_density2d(): bandwidths must be strictly positive"

My dataframe looks like this:

> df

transcriptID S1.x      S1.y      S2.x       S2.y    
DQ459412     0.000000  0.000000  0.000000   0.000000
DQ459413     1.584963  2.358379  4.392317   3.085722    
DQ459415     0.000000  0.000000  0.000000   0.000000    
DQ459418     0.000000  0.000000  0.000000   0.000000    
DQ459419     0.000000  0.000000  4.000000   2.891544    
DQ459420     0.000000  0.000000  0.000000   0.000000      

Also, var(df[,"S1.x"]) > 0 and var(df[,"S1.y"]) > 0.

Fig 1 - 2d density plot with error

However, I got a density plot without error by running:

ggplot(df, aes(x = S2.x, y = S2.y)) + geom_point() + geom_density_2d()

Fig 2 - density plot without error

How do I address the error in Fig 1?

like image 911
Onyi Ukay Avatar asked Oct 31 '18 02:10

Onyi Ukay


2 Answers

So the real problem is that the S1.x and S1.y values only have one non-zero value in their columns. And it turns out that geom_density_2d can't really estimate a density with only a value or two. But read on...

Update:

This question has been asked before, and the answers are usually that you need to have non-zero variance in your data columns. But you do have non-zero variance, so why isn't it working?

  • Looking at the internals of geom_density_2d we see that it uses the MASS::kde2d package function to calculate the distribution.
  • Looking at kde2d we see that it uses MASS::bandwidth.nrd(df$x) to get an estimate of the bandwidth.
  • Looking at the help (which has the code) for bandwidth.nrd we see it uses a rule of thumb that gets the quantile of the distribution, and subtracts the 2nd quantile from the 1st quantile to get a bandwidth estimate.
  • Doing a quantile on your original data we see that the quantiles of the data were zero.
  • And running MASS::kde2d on your original data with that bandwidth.nrd estimate of the bandwidth gives you the same error:
library(MASS)
nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0,0)
s1y <- c(0,2.358379,0,0,0,0) 
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0) 
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
> quantile(df$s1x)
      0%      25%      50%      75%     100% 
0.000000 0.000000 0.000000 0.000000 1.584963 
> quantile(df$s1y)
      0%      25%      50%      75%     100% 
0.000000 0.000000 0.000000 0.000000 2.358379 
h <- c(MASS::bandwidth.nrd(df$x), MASS::bandwidth.nrd(df$y))
dens <- MASS::kde2d(df$s1x, df$s1y, h = h, n = n,  lims = c(0,1,0,1))

Error in MASS::kde2d(df$s1x, df$s1y, h = h, n = n, lims = c(0, 1, 0, 1)) : bandwidths must be strictly positive

So the real criteria for using geom_density_2D is that both the x- and the y-data needs to have a non-zero gap between their 1st and 2nd quantiles.

Now to fix it, if I make a small modification - replacing one of the zeros with 0.1, like this:

nn <- c("DQ459412","DQ459413","DQ459415","DQ459418","DQ459419","DQ459420")
s1x <- c(0,1.584963,0,0,0.1,0)
s1y <- c(0,2.358379,0,0,0.1,0) 
s2x <- c(0,4.392317,0,0,4,0)
s2y <- c(0,3.085722,0,0,2.891544,0) 
df <- data.frame(transcriptID=nn,S1.x=s1x,S1.y=s1y,S2.x=s2x,S2.y=s2y)
print(df)

yielding:

  transcriptID     S1.x     S1.y     S2.x     S2.y
1     DQ459412 0.000000 0.000000 0.000000 0.000000
2     DQ459413 1.584963 2.358379 4.392317 3.085722
3     DQ459415 0.000000 0.000000 0.000000 0.000000
4     DQ459418 0.000000 0.000000 0.000000 0.000000
5     DQ459419 0.100000 0.100000 4.000000 2.891544
6     DQ459420 0.000000 0.000000 0.000000 0.000000

Then I get this plot instead of your error.

enter image description here You can let that 0.1 value approach zero, eventually it will not be able to calculate a distribution anymore and you will get your error again.

One general way to deal with this situation is to add a very small quantity of noise to your data, kind of simulating the fact that any meaningful calculation based on a real measurement from a continuous distribution should be impervious to that small quantity of noise.

Hope that helps.

like image 113
Mike Wise Avatar answered Sep 21 '22 06:09

Mike Wise


The answer of @Mike Wise is pretty solid indeed and my answer is somewhat complementary to it. Actually, the bandwidth.nrd function computes the difference between the 3rd and the 1st quantile not the 2nd and 1st (code from the function):

r <- quantile(distances, c(0.25, 0.75))

Instead of adding random noise to your data, I would suggest to precompute the bandwidths yourself and pass them to the function, testing for non-zero values like so:

kde2d(df$s1x, df$s1y, 
      h = c(ifelse(bandwidth.nrd(df$s1x) == 0, 0.1, bandwidth.nrd(df$s1x)),
            ifelse(bandwidth.nrd(df$s1y) == 0, 0.1, bandwidth.nrd(df$s1y))))

Hope this helps.

like image 29
Marius Avatar answered Sep 18 '22 06:09

Marius