Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Gaussian Kernel density calculate score for new values

this is my code:

import numpy as np
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist

import re
import json

attribute_file="path"

attribute_values = [line.rstrip('\n') for line in open(attribute_file)]

obs=[]

#Assume the list obs as loaded

obs=np.asarray(osservazioni)
obs=np.sort(obs,kind='mergesort')
x_min=osservazioni[0]
x_max=osservazioni[len(obs)-1]



# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(obs)

# plotting the result
x = linspace(0,x_max,1000)

plot(x,my_pdf(x),'r') # distribution function

hist(obs,normed=1,alpha=.3) # histogram
show()

new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis]
for e in new_values:
    print (str(e)+" - "+str(my_pdf(e)*100*2))

Problem: The obs array contains a list of all obs. I need to calcolate a score (between 0 and 1) for new values

[-1, 0, 2, 3, 4, 500, 768]

So the value -1 must have a discrete score because it doesn't appears in the distribution but is next to the 1 value that is very common in the observations.

like image 447
Usi Usi Avatar asked Aug 19 '15 17:08

Usi Usi


2 Answers

The reason for that is that you have many more 1's in your observations than 768's. So even if -1 is not exactly 1, it gets a high predicted value, because the histogram has a much larger larger value at 1 than at 768.

Up to a multiplicative constant, the formula for prediction is:

enter image description here

where K is your kernel, D your observations and h your bandwitdh. Looking at the doc for gaussian_kde, we see that if no value is provided for bw_method, it is estimated in some way, which here doesn't suit you.

So you can try some different values: the larger the bandwidth, the more points far from your new data are taken into account, the limit case being an almost constant predicted function.

On the other hand, a very small bandwidth only takes really close points into account, which is what I thing you want.

Some graphs to illustrate the influence of the bandwidth: enter image description here

Code used:

import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, 2, figsize=(10, 10))
for i, h in enumerate([0.01, 0.1, 1, 5]):
    my_pdf = gaussian_kde(osservazioni, h)
    axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function
    axarr[i//2, i%2].set_title("Bandwidth: {0}".format(h))
    axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram

With your current code, for x=-1, the value of K((x-x_i)/h) for all x_i's who are equal to 1 is smaller than 1, but you add up a lot of these values (there are 921 1s in your observations, and also 357 2s)

On the other hand for x = 768, the value of the kernel is 1 for all x_i's which are 768, but there are not many such points (39 to be precise). So here a lot of "small" terms make a larger sum than a small number of larger terms.

If you don't want this behavior, you can decrease the size of your gaussian kernel : this way the penalty (K(-2)) paid because of the distance between -1 and 1 will be higher. But I think that this would be overfitting your observations.

A formula to determine whether a new sample is acceptable (compared to your empirical distribution) or not is more of a statistical problem, you can have a look at stats.stackexchange.com

You can always try to use a low value for the bandwidth, which will give you a peaked predicted function. Then you can normalize this function, dividing it by its maximal value.

After that, all predicted values will be between 0 and 1:

maxDensityValue = np.max(my_pdf(x))
for e in new_values:
    print("{0} {1}".format(e, my_pdf(e)/maxDensityValue))
like image 89
P. Camilleri Avatar answered Nov 15 '22 17:11

P. Camilleri


-1 and 0 are both very close to 1 which occurs very frequently, so they will be predicted to have a higher value. (This is why 0 has a higher value than -1 even though they both don't show up, 0 is closer to 1).

What you need is a smaller bandwidth: Look at the line in your graph to see this - Right now numbers that don't show up at all as far away as 80 are getting a lot of value because of their proximity to 1 and 2.
Just set a scalar as your bandwidth_method to achieve this:

my_pdf = gaussian_kde(osservazioni, 0.1)

This may not be the exact scalar you want but try changing 0.1 to 0.05 or even less and see what fits what you are looking for.

Also if you want a value between 0 and 1 you need to make sure that my_pdf() can never return a value over .005 because you are multiplying it by 200.
Here is what I mean:

for e in new_values:
    print (str(e)+" - "+str(my_pdf(e)*100*2))

The value you are outputting is:

mypdf(e)*100*2 == mypdf(e)*200
#You want the max value to be 1 so
1 >= mypdf(e)*200
#Divide both sides by 200
0.005 >= mypdf(e)

So mypdf() needs to have a max value of 0.005. OR You can just scale the data.

For the max value to be 1 and stay proportionate to the input, no matter the input, you would need to first collect the output and then scale it based on the largest value.
Example:

orig_val=[] #Create intermediate list

for e in new_values:
    orig_val += [my_pdf(e)*100*2] #Fill with the data

for i in range(len(new_values)):
    print (str(new_values[i])+" - "+str(orig_val[i]/max(orig_val))) #Scale based on largest value

Learn more about the gaussian_kde here: scipy.stats.gaussian_kde

like image 35
ThatGuyRussell Avatar answered Nov 15 '22 17:11

ThatGuyRussell