I am trying to implement a kernel density estimation. However my code does not provide the answer it should. It is also written in julia but the code should be self explanatory.
Here is the algorithm:
where
So the algorithm tests whether the distance between x and an observation X_i weighted by some constant factor (the binwidth) is less then one. If so, it assigns 0.5 / (n * h) to that value, where n = #of observations.
Here is my implementation:
#Kernel density function.
#Purpose: estimate the probability density function (pdf)
#of given observations
#@param data: observations for which the pdf should be estimated
#@return: returns an array with the estimated densities
function kernelDensity(data)
|
| #Uniform kernel function.
| #@param x: Current x value
| #@param X_i: x value of observation i
| #@param width: binwidth
| #@return: Returns 1 if the absolute distance from
| #x(current) to x(observation) weighted by the binwidth
| #is less then 1. Else it returns 0.
|
| function uniformKernel(x, observation, width)
| | u = ( x - observation ) / width
| | abs ( u ) <= 1 ? 1 : 0
| end
|
| #number of observations in the data set
| n = length(data)
|
| #binwidth (set arbitraily to 0.1
| h = 0.1
|
| #vector that stored the pdf
| res = zeros( Real, n )
|
| #counter variable for the loop
| counter = 0
|
| #lower and upper limit of the x axis
| start = floor(minimum(data))
| stop = ceil (maximum(data))
|
| #main loop
| #@linspace: divides the space from start to stop in n
| #equally spaced intervalls
| for x in linspace(start, stop, n)
| | counter += 1
| | for observation in data
| | |
| | | #count all observations for which the kernel
| | | #returns 1 and mult by 0.5 because the
| | | #kernel computed the absolute difference which can be
| | | #either positive or negative
| | | res[counter] += 0.5 * uniformKernel(x, observation, h)
| | end
| | #devide by n times h
| | res[counter] /= n * h
| end
| #return results
| res
end
#run function
#@rand: generates 10 uniform random numbers between 0 and 1
kernelDensity(rand(10))
and this is being returned:
> 0.0
> 1.5
> 2.5
> 1.0
> 1.5
> 1.0
> 0.0
> 0.5
> 0.5
> 0.0
the sum of which is: 8.5 (The cumulative distibution function. Should be 1.)
So there are two bugs:
For example:
> kernelDensity(rand(1000))
> 953.53
I believe that I implemented the formula 1:1, hence I really don't understand where the error is.
I'm not an expert on KDEs, so take all of this with a grain of salt, but a very similar (but much faster!) implementation of your code would be:
function kernelDensity{T<:AbstractFloat}(data::Vector{T}, h::T)
res = similar(data)
lb = minimum(data); ub = maximum(data)
for (i,x) in enumerate(linspace(lb, ub, size(data,1)))
for obs in data
res[i] += abs((obs-x)/h) <= 1. ? 0.5 : 0.
end
res[i] /= (n*h)
end
sum(res)
end
If I'm not mistaken, the density estimate should integrate to 1, that is we would expect kernelDensity(rand(100), 0.1)/100
to get at least close to 1. In the implementation above I'm getting there, give or take 5%, but then again we don't know that 0.1 is the optimal bandwith (using h=0.135
instead I'm getting there to within 0.1%), and the uniform Kernel is known to only be about 93% "efficient".
In any case, there's a very good Kernel Density package in Julia available here, so you probably should just do Pkg.add("KernelDensity")
instead of trying to code your own Epanechnikov kernel :)
To point out the mistake: You have n bins B_i of size 2h covering [0,1], a random point X lands in expected number of bins. You divide by 2 n h.
For n points, the expected value of your function is .
Actually, you have some bins of size < 2h. (for example if start = 0, half of first the bin is outside of [0,1]), factoring this in gives the bias.
Edit: Btw, the bias is easy to calculate if you assume that the bins have random locations in [0,1]. Then the bins are on average missing h/2 = 5% of their size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With