this is my code:
import numpy as np
from scipy.stats.kde import gaussian_kde
from scipy.stats import norm
from numpy import linspace,hstack
from pylab import plot,show,hist
import re
import json
attribute_file="path"
attribute_values = [line.rstrip('\n') for line in open(attribute_file)]
obs=[]
#Assume the list obs as loaded
obs=np.asarray(osservazioni)
obs=np.sort(obs,kind='mergesort')
x_min=osservazioni[0]
x_max=osservazioni[len(obs)-1]
# obtaining the pdf (my_pdf is a function!)
my_pdf = gaussian_kde(obs)
# plotting the result
x = linspace(0,x_max,1000)
plot(x,my_pdf(x),'r') # distribution function
hist(obs,normed=1,alpha=.3) # histogram
show()
new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis]
for e in new_values:
print (str(e)+" - "+str(my_pdf(e)*100*2))
Problem: The obs array contains a list of all obs. I need to calcolate a score (between 0 and 1) for new values
[-1, 0, 2, 3, 4, 500, 768]
So the value -1 must have a discrete score because it doesn't appears in the distribution but is next to the 1 value that is very common in the observations.
The reason for that is that you have many more 1's in your observations than 768's. So even if -1 is not exactly 1, it gets a high predicted value, because the histogram has a much larger larger value at 1 than at 768.
Up to a multiplicative constant, the formula for prediction is:
where K is your kernel, D your observations and h your bandwitdh. Looking at the doc for gaussian_kde
, we see that if no value is provided for bw_method
, it is estimated in some way, which here doesn't suit you.
So you can try some different values: the larger the bandwidth, the more points far from your new data are taken into account, the limit case being an almost constant predicted function.
On the other hand, a very small bandwidth only takes really close points into account, which is what I thing you want.
Some graphs to illustrate the influence of the bandwidth:
Code used:
import matplotlib.pyplot as plt
f, axarr = plt.subplots(2, 2, figsize=(10, 10))
for i, h in enumerate([0.01, 0.1, 1, 5]):
my_pdf = gaussian_kde(osservazioni, h)
axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function
axarr[i//2, i%2].set_title("Bandwidth: {0}".format(h))
axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram
With your current code, for x=-1, the value of K((x-x_i)/h) for all x_i's who are equal to 1 is smaller than 1, but you add up a lot of these values (there are 921 1s in your observations, and also 357 2s)
On the other hand for x = 768, the value of the kernel is 1 for all x_i's which are 768, but there are not many such points (39 to be precise). So here a lot of "small" terms make a larger sum than a small number of larger terms.
If you don't want this behavior, you can decrease the size of your gaussian kernel : this way the penalty (K(-2)) paid because of the distance between -1 and 1 will be higher. But I think that this would be overfitting your observations.
A formula to determine whether a new sample is acceptable (compared to your empirical distribution) or not is more of a statistical problem, you can have a look at stats.stackexchange.com
You can always try to use a low value for the bandwidth, which will give you a peaked predicted function. Then you can normalize this function, dividing it by its maximal value.
After that, all predicted values will be between 0 and 1:
maxDensityValue = np.max(my_pdf(x))
for e in new_values:
print("{0} {1}".format(e, my_pdf(e)/maxDensityValue))
-1 and 0 are both very close to 1 which occurs very frequently, so they will be predicted to have a higher value. (This is why 0 has a higher value than -1 even though they both don't show up, 0 is closer to 1).
What you need is a smaller bandwidth: Look at the line in your graph to see this - Right now numbers that don't show up at all as far away as 80 are getting a lot of value because of their proximity to 1 and 2.
Just set a scalar as your bandwidth_method to achieve this:
my_pdf = gaussian_kde(osservazioni, 0.1)
This may not be the exact scalar you want but try changing 0.1 to 0.05 or even less and see what fits what you are looking for.
Also if you want a value between 0 and 1 you need to make sure that my_pdf() can never return a value over .005 because you are multiplying it by 200.
Here is what I mean:
for e in new_values:
print (str(e)+" - "+str(my_pdf(e)*100*2))
The value you are outputting is:
mypdf(e)*100*2 == mypdf(e)*200
#You want the max value to be 1 so
1 >= mypdf(e)*200
#Divide both sides by 200
0.005 >= mypdf(e)
So mypdf() needs to have a max value of 0.005. OR You can just scale the data.
For the max value to be 1 and stay proportionate to the input, no matter the input, you would need to first collect the output and then scale it based on the largest value.
Example:
orig_val=[] #Create intermediate list
for e in new_values:
orig_val += [my_pdf(e)*100*2] #Fill with the data
for i in range(len(new_values)):
print (str(new_values[i])+" - "+str(orig_val[i]/max(orig_val))) #Scale based on largest value
Learn more about the gaussian_kde here: scipy.stats.gaussian_kde
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With