Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference in 2D KDE produced using kde2d (R) and ksdensity2d (Matlab)

While trying to port some code from Matlab to R I have run into a problem. The gist of the code is to produce a 2D kernel density estimate and then do some simple calculations using the estimate. In Matlab the KDE calculation was done using the function ksdensity2d.m. In R the KDE calculation is done with kde2d from the MASS package. So lets say I want to calculate the KDE and just add the values (this is not what i intend to do, but it serves this purpose). In R this can be done by

    library(MASS)
    set.seed(1009)
    x <- sample(seq(1000, 2000), 100, replace=TRUE)
    y <- sample(seq(-12, 12), 100, replace=TRUE)
    kk <- kde2d(x, y, h=c(30, 1.5), n=100, lims=c(1000, 2000, -12, 12))
    sum(kk$z)

which gives the answer 0.3932732. When using ksdensity2d in Matlab using the same exact data and conditions the answer is 0.3768. From looking at the code for kde2d I noticed that the bandwidth is divided by 4

    kde2d <- function (x, y, h, n = 25, lims = c(range(x), range(y))) 
    {
    nx <- length(x)
    if (length(y) != nx) 
     stop("data vectors must be the same length")
    if (any(!is.finite(x)) || any(!is.finite(y))) 
     stop("missing or infinite values in the data are not allowed")
    if (any(!is.finite(lims))) 
     stop("only finite values are allowed in 'lims'")
    n <- rep(n, length.out = 2L)
    gx <- seq.int(lims[1L], lims[2L], length.out = n[1L])
    gy <- seq.int(lims[3L], lims[4L], length.out = n[2L])
    h <- if (missing(h)) 
    c(bandwidth.nrd(x), bandwidth.nrd(y))
    else rep(h, length.out = 2L)
    if (any(h <= 0)) 
     stop("bandwidths must be strictly positive")
    h <- h/4
    ax <- outer(gx, x, "-")/h[1L]
    ay <- outer(gy, y, "-")/h[2L]
    z <- tcrossprod(matrix(dnorm(ax), , nx), matrix(dnorm(ay), 
     , nx))/(nx * h[1L] * h[2L])
    list(x = gx, y = gy, z = z)
    }

A simple check to see if the difference in bandwidth is the reason for the difference in the results is then

    kk <- kde2d(x, y, h=c(30, 1.5)*4, n=100, lims=c(1000, 2000, -12, 12))
    sum(kk$z)

which gives 0.3768013 (which is the same as the Matlab answer).

So my question is then: Why does kde2d divide the bandwidth by four? (Or why doesn't ksdensity2d?)

like image 805
mkr Avatar asked Sep 27 '22 17:09

mkr


1 Answers

At the mirrored github source, lines 31-35:

if (any(h <= 0))
    stop("bandwidths must be strictly positive")
h <- h/4                            # for S's bandwidth scale
ax <- outer(gx, x, "-" )/h[1L]
ay <- outer(gy, y, "-" )/h[2L]

and the help file for kde2d(), which suggests looking at the help file for bandwidth. That says:

...which are all scaled to the width argument of density and so give answers four times as large.

But why?

density() says that the width argument exists for the sake of compatibility with S (the precursor to R). The comments in the source for density() read:

## S has width equal to the length of the support of the kernel
## except for the gaussian where it is 4 * sd.
## R has bw a multiple of the sd.

The default is the Gaussian one. When the bw argument is unspecified and width is, width is substituted in, eg.

library(MASS)

set.seed(1)
x <- rnorm(1000, 10, 2)
all.equal(density(x, bw = 1), density(x, width = 4)) # Only the call is different

However, because kde2d() was apparently written to remain compatible with S (and I suppose it was originally written FOR S, given it's in MASS), everything ends up divided by four. After flipping to the relevant section of MASS the book (around p.126), it seems they may have picked four to strike a balance between smoothness and fidelity of data.

In conclusion, my guess is that kde2d() divides by four to remain consistent with the rest of MASS (and other things originally written for S), and that the way you're going about things looks fine.

like image 146
alexforrence Avatar answered Oct 12 '22 23:10

alexforrence