Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kernel density estimation julia

I am trying to implement a kernel density estimation. However my code does not provide the answer it should. It is also written in julia but the code should be self explanatory.

Here is the algorithm:

\$ f(x) = \frac{1}{n*h} * \sum_{i = 1}^n K(\frac{x - X_i}{h}) \$

where

\$ K(u) = 0.5*I(|u| <= 1)\$ with \$ u = \frac{x - X_i}{h}\$

So the algorithm tests whether the distance between x and an observation X_i weighted by some constant factor (the binwidth) is less then one. If so, it assigns 0.5 / (n * h) to that value, where n = #of observations.

Here is my implementation:

#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
#@param data: observations for which the pdf should be estimated
#@return: returns an array with the estimated densities 

function kernelDensity(data)
|   
|   #Uniform kernel function. 
|   #@param x: Current x value
|   #@param X_i: x value of observation i
|   #@param width: binwidth
|   #@return: Returns 1 if the absolute distance from
|   #x(current) to x(observation) weighted by the binwidth
|   #is less then 1. Else it returns 0.
|  
|   function uniformKernel(x, observation, width)
|   |   u = ( x - observation ) / width
|   |   abs ( u ) <= 1 ? 1 : 0
|   end
|
|   #number of observations in the data set 
|   n = length(data)
|
|   #binwidth (set arbitraily to 0.1
|   h = 0.1 
|   
|   #vector that stored the pdf
|   res = zeros( Real, n )
|   
|   #counter variable for the loop 
|   counter = 0
|
|   #lower and upper limit of the x axis
|   start = floor(minimum(data))
|   stop = ceil (maximum(data))
|
|   #main loop
|   #@linspace: divides the space from start to stop in n
|   #equally spaced intervalls
|   for x in linspace(start, stop, n) 
|   |   counter += 1
|   |   for observation in data
|   |   |
|   |   |   #count all observations for which the kernel
|   |   |   #returns 1 and mult by 0.5 because the
|   |   |   #kernel computed the absolute difference which can be
|   |   |   #either positive or negative
|   |   |   res[counter] += 0.5 * uniformKernel(x, observation, h)
|   |   end
|   |   #devide by n times h
|   |   res[counter] /= n * h
|   end
|   #return results
|   res
end
#run function
#@rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))

and this is being returned:

> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0

the sum of which is: 8.5 (The cumulative distibution function. Should be 1.)

So there are two bugs:

  1. The values are not properly scaled. Each number should be around one tenth of their current values. In fact, if the number of observation increases by 10^n n = 1, 2, ... then the cdf also increases by 10^n

For example:

> kernelDensity(rand(1000))
> 953.53 
  1. They don't sum up to 10 (or one if it were not for the scaling error). The error becomes more evident as the sample size increases: there are approx. 5% of the observations not being included.

I believe that I implemented the formula 1:1, hence I really don't understand where the error is.

like image 971
Vincent Avatar asked Feb 09 '23 03:02

Vincent


2 Answers

I'm not an expert on KDEs, so take all of this with a grain of salt, but a very similar (but much faster!) implementation of your code would be:

function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
  res = similar(data)
  lb = minimum(data); ub = maximum(data)
  for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
    for obs in data
      res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
    end
    res[i] /= (n*h)
 end
 sum(res)
end

If I'm not mistaken, the density estimate should integrate to 1, that is we would expect kernelDensity(rand(100), 0.1)/100 to get at least close to 1. In the implementation above I'm getting there, give or take 5%, but then again we don't know that 0.1 is the optimal bandwith (using h=0.135 instead I'm getting there to within 0.1%), and the uniform Kernel is known to only be about 93% "efficient".

In any case, there's a very good Kernel Density package in Julia available here, so you probably should just do Pkg.add("KernelDensity") instead of trying to code your own Epanechnikov kernel :)

like image 70
Nils Gudat Avatar answered Feb 14 '23 09:02

Nils Gudat


To point out the mistake: You have n bins B_i of size 2h covering [0,1], a random point X lands in expected number of E \sum_{i = 1}^{n} 1{X \in B_i} = n E 1{X \in B_i} = 2 n h bins. You divide by 2 n h.

For n points, the expected value of your function is enter image description here.

Actually, you have some bins of size < 2h. (for example if start = 0, half of first the bin is outside of [0,1]), factoring this in gives the bias.

Edit: Btw, the bias is easy to calculate if you assume that the bins have random locations in [0,1]. Then the bins are on average missing h/2 = 5% of their size.

like image 42
mschauer Avatar answered Feb 14 '23 09:02

mschauer