Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate how a value differs from the average of values using the Gaussian Kernel Density (Python)

I use this code to calculate a Gaussian Kernel Density on this values

from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(0,4))
print (x_grid)

This is the code to calculate the Gaussian Kernel Density

from statsmodels.nonparametric.kde import KDEUnivariate
import matplotlib.pyplot as plt

def kde_statsmodels_u(x, x_grid, bandwidth=0.2, **kwargs):
    """Univariate Kernel Density Estimation with Statsmodels"""
    kde = KDEUnivariate(x)
    kde.fit(bw=bandwidth, **kwargs)
    return kde.evaluate(x_grid)

import numpy as np
from scipy.stats.distributions import norm

# The grid we'll use for plotting
from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(0,4))
print (x_grid)

# Draw points from a bimodal distribution in 1D
np.random.seed(0)
x = np.concatenate([norm(-1, 1.).rvs(400),
                    norm(1, 0.3).rvs(100)])

pdf_true = (0.8 * norm(-1, 1).pdf(x_grid) +
            0.2 * norm(1, 0.3).pdf(x_grid))

# Plot the three kernel density estimates
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(13, 8))
fig.subplots_adjust(wspace=0)

pdf=kde_statsmodels_u(x, x_grid, bandwidth=0.2)
ax[0].plot(x_grid, pdf, color='blue', alpha=0.5, lw=3)
ax[0].fill(x_grid, pdf_true, ec='gray', fc='gray', alpha=0.4)
ax[0].set_title("kde_statsmodels_u")
ax[0].set_xlim(-4.5, 3.5)

plt.show()

All the values in the grid are between 0 e 4. If I receive a new value of 5 I want to calculate how that value differs from the average values and assign to it a score between 0 and 1. (setting a threshold)

So if I receive as a new value 5 its score must be close to 0.90, while if I receive as a new value 500 its score must be close to 0.0.

How can I do that? Is my function to calculate the Gaussian Kernel Density correct or is there a better way/library to do that?

* UPDATE * I read an example in a paper. The weight of a washing machine is typically of 100 kg. Usually vendors use the kg unit to also refer its capacity (example 9 kg). For a human is easy to understand that 9 gk is the capacity and not the total weight of the washing machine. We can “fake” this form of intelligence without deep language understanding, by instead modeling a distribution of values over training data for each attribute.

For a given attribute a (weight of a washing machine for example), let Va = {va1, va2, . . . van} (|Va| = n) be the set of values of attribute a corresponding to products in the training data. If I found a new value v Intuitively it is “close” to (the distribution estimated from) Va, then we should feel more confident assigning this value to a (example weight of a washing machine).

An idea could be to measure the number of standard deviations by which the new value v differs from the average of values in Va but a better one could be to model a (Gaussian) kernel density on Va, and then express the support at new value v as the density at that point:

enter image description here

where where σ^(2)ak is the variance of the kth Gaussian, and Z is a constant to make sure S(c.s.v, Va) ∈ [0, 1]. How can I obtain it in Python using the statsmodels library?

* UPDATED 2 * Example of data... but I think that is not very important... Generated by this code...

from random import randint
x_grid=[]
for i in range(1000):
    x_grid.append(randint(1,3))
print (x_grid)

[2, 2, 1, 2, 2, 3, 1, 1, 1, 2, 2, 2, 1, 1, 3, 3, 1, 2, 1, 3, 2, 3, 3, 1, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 2, 3, 2, 1, 2, 3, 3, 2, 2, 3, 3, 2, 2, 1, 2, 1, 2, 2, 3, 3, 1, 1, 2, 3, 3, 2, 1, 2, 3, 3, 3, 3, 2, 1, 3, 2, 2, 1, 3, 3, 1, 2, 1, 3, 2, 3, 3, 1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 1, 2, 1, 1, 2, 3, 2, 1, 2, 2, 2, 3, 2, 3, 3, 1, 1, 3, 2, 1, 1, 3, 3, 3, 2, 1, 2, 2, 1, 3, 2, 3, 1, 3, 1, 2, 3, 1, 3, 2, 2, 1, 1, 2, 2, 3, 1, 1, 3, 2, 2, 1, 2, 1, 2, 3, 1, 3, 3, 1, 2, 1, 2, 1, 3, 1, 3, 3, 2, 1, 1, 3, 2, 2, 2, 3, 2, 1, 3, 2, 1, 1, 3, 3, 3, 2, 1, 1, 3, 2, 1, 2, 2, 2, 1, 3, 1, 3, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 3, 1, 1, 2, 2, 1, 1, 1, 3, 3, 3, 3, 1, 3, 1, 3, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 3, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 1, 1, 2, 3, 1, 1, 1, 3, 1, 3, 2, 2, 3, 1, 3, 3, 2, 2, 3, 2, 1, 2, 1, 1, 1, 2, 2, 3, 2, 1, 1, 3, 1, 2, 1, 3, 3, 3, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 1, 3, 1, 3, 3, 2, 3, 2, 1, 3, 3, 3, 3, 3, 1, 2, 2, 2, 1, 1, 3, 2, 3, 1, 2, 3, 2, 3, 2, 1, 1, 3, 3, 1, 1, 2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 1, 1, 2, 3, 2, 3, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 3, 3, 2, 2, 3, 3, 3, 2, 1, 1, 2, 2, 3, 2, 3, 2, 1, 1, 1, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 1, 3, 2, 1, 3, 3, 2, 1, 1, 3, 1, 3, 1, 2, 2, 1, 3, 3, 2, 3, 1, 1, 3, 1, 2, 2, 1, 3, 2, 3, 1, 1, 3, 1, 3, 1, 2, 1, 3, 2, 2, 2, 2, 1, 3, 2, 1, 3, 3, 2, 3, 2, 1, 3, 1, 2, 1, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 1, 1, 3, 2, 3, 2, 2, 2, 3, 1, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 3, 3, 1, 3, 2, 3, 1, 1, 2, 1, 3, 1, 2, 2, 3, 3, 1, 3, 1, 1, 2, 2, 1, 3, 3, 3, 1, 2, 2, 2, 1, 3, 1, 2, 2, 2, 3, 3, 3, 1, 1, 2, 3, 3, 1, 1, 2, 3, 2, 3, 3, 2, 2, 1, 3, 3, 3, 3, 2, 3, 1, 3, 3, 2, 1, 3, 2, 1, 1, 3, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 3, 3, 3, 2, 1, 3, 1, 1, 1, 1, 3, 1, 2, 3, 3, 3, 2, 3, 1, 2, 2, 2, 3, 2, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 3, 3, 3, 2, 3, 3, 2, 1, 1, 1, 2, 3, 1, 3, 3, 2, 1, 3, 3, 3, 2, 2, 1, 2, 3, 2, 3, 3, 3, 3, 2, 3, 2, 1, 2, 1, 1, 3, 3, 3, 2, 2, 3, 1, 3, 2, 1, 3, 1, 1, 3, 3, 1, 2, 2, 2, 3, 3, 1, 2, 1, 2, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 3, 2, 2, 1, 3, 1, 1, 3, 2, 1, 2, 3, 2, 1, 3, 3, 3, 2, 3, 1, 2, 3, 3, 1, 2, 2, 2, 3, 1, 2, 1, 1, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 2, 3, 3, 1, 2, 1, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3, 1, 1, 3, 3, 1, 3, 1, 2, 1, 2, 1, 3, 2, 2, 1, 3, 1, 3, 3, 1, 3, 1, 1, 1, 1, 3, 2, 1, 2, 3, 1, 1, 3, 1, 1, 3, 1, 3, 3, 3, 1, 1, 3, 1, 3, 2, 2, 2, 1, 1, 2, 3, 3, 2, 3, 3, 1, 2, 3, 2, 2, 3, 1, 2, 2, 2, 1, 1, 3, 1, 2, 2, 2, 1, 1, 2, 3, 1, 3, 1, 1, 3, 2, 2, 3, 2, 2, 3, 3, 1, 1, 2, 2, 3, 1, 1, 2, 3, 2, 2, 3, 1, 2, 2, 1, 1, 3, 2, 3, 1, 1, 3, 1, 3, 2, 3, 3, 3, 3, 3, 2, 2, 3, 2, 1, 1, 1, 3, 3, 1, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, 3, 1, 1, 1, 1, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1, 3, 3, 3, 1, 1, 2, 2, 3, 2, 3, 2, 2, 1, 2, 1, 2, 1, 2, 2, 3, 1, 1, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, 1, 2, 2, 1, 1, 2, 3, 3, 1, 3, 3, 1, 3, 3, 1, 3, 2, 2, 2, 1, 1, 2, 1, 3, 1, 1, 1, 2, 3, 3, 2, 3, 1, 3]

This array represents the ram of new smartphones in the market... Usually they have 1,2,3 GB of ram.

That's the kernel density

enter image description here

*** UPDATE

I try the code with this values

[1024, 1, 1024, 1000, 1024, 128, 1536, 16, 192, 2048, 2000, 2048, 24, 250, 256, 278, 288, 290, 3072, 3, 3000, 3072, 32, 384, 4096, 4, 4096, 448, 45, 512, 576, 64, 768, 8, 96]

The values are all in mb... do you think that is working well? I think that I must set a threshold

      100%      cdfv      kdev
1       42  0.210097  0.499734
1024    96  0.479597  0.499983
5000     0  0.000359  0.498885
2048    36  0.181609  0.499700
3048     8  0.040299  0.499424

* UPDATE 3 *

[256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 256, 256, 256, 512, 512, 512, 128, 128, 128, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 128, 128, 128, 512, 512, 512, 256, 256, 256, 256, 256, 256, 1024, 1024, 1024, 512, 512, 512, 128, 128, 128, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 4, 4, 4, 3, 3, 3, 24, 24, 24, 8, 8, 8, 16, 16, 16, 16, 16, 16, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 2048, 2048, 2048, 2048, 2048, 2048, 4096, 4096, 4096, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 768, 768, 768, 768, 768, 768, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 256, 256, 256, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 128, 128, 128, 576, 576, 576, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 576, 576, 576, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 2048, 2048, 2048, 768, 768, 768, 768, 768, 768, 768, 768, 768, 512, 512, 512, 192, 192, 192, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 384, 384, 384, 448, 448, 448, 576, 576, 576, 384, 384, 384, 288, 288, 288, 768, 768, 768, 384, 384, 384, 288, 288, 288, 64, 64, 64, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 64, 64, 64, 128, 128, 128, 128, 128, 128, 128, 128, 128, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 256, 256, 256, 768, 768, 768, 768, 768, 768, 768, 768, 768, 256, 256, 256, 192, 192, 192, 256, 256, 256, 64, 64, 64, 256, 256, 256, 192, 192, 192, 128, 128, 128, 256, 256, 256, 192, 192, 192, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 128, 128, 128, 128, 128, 128, 384, 384, 384, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 32, 32, 32, 768, 768, 768, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 128, 128, 128, 128, 128, 128, 1024, 1024, 1024, 1024, 1024, 1024, 128, 128, 128, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048, 384, 384, 384, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 128, 128, 128, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 768, 768, 768, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 128, 128, 128, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 64, 64, 64, 64, 64, 64, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 16, 16, 16, 3072, 3072, 3072, 3072, 3072, 3072, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 32, 32, 32, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 32, 32, 32, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 512, 512, 512, 1, 1, 1, 1024, 1024, 1024, 32, 32, 32, 32, 32, 32, 45, 45, 45, 8, 8, 8, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 16, 16, 16, 4, 4, 4, 4, 4, 4, 4, 4, 4, 16, 16, 16, 16, 16, 16, 16, 16, 16, 64, 64, 64, 8, 8, 8, 8, 8, 8, 8, 8, 8, 64, 64, 64, 64, 64, 64, 256, 256, 256, 64, 64, 64, 64, 64, 64, 512, 512, 512, 512, 512, 512, 512, 512, 512, 32, 32, 32, 32, 32, 32, 32, 32, 32, 128, 128, 128, 128, 128, 128, 128, 128, 128, 32, 32, 32, 128, 128, 128, 64, 64, 64, 64, 64, 64, 16, 16, 16, 256, 256, 256, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 256, 256, 256, 256, 256, 256, 1024, 1024, 1024, 1024, 1024, 1024, 256, 256, 256, 3072, 3072, 3072, 3072, 3072, 3072, 128, 128, 128, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 128, 128, 128, 128, 128, 128, 64, 64, 64, 256, 256, 256, 256, 256, 256, 512, 512, 512, 768, 768, 768, 768, 768, 768, 16, 16, 16, 32, 32, 32, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 512, 512, 512, 2048, 2048, 2048, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 3072, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 64, 64, 64, 96, 96, 96, 512, 512, 512, 64, 64, 64, 64, 64, 64, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 64, 64, 64, 64, 64, 64, 256, 256, 256, 1024, 1024, 1024, 512, 512, 512, 256, 256, 256, 512, 512, 512, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 512, 512, 512, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048, 3072, 3072, 3072, 3072, 3072, 3072, 2048, 2048, 2048, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 2048, 2048, 2048, 2048, 2048, 2048, 1024, 1024, 1024, 2048, 2048, 2048, 3072, 3072, 3072, 2048, 2048, 2048]

With this data if I try as new value this number

# new values
x = np.asarray([128,512,1024,2048,3072,2800])

Something goes wrong with the 3072 (all values are in MB).

This is the result:

      100%      cdfv      kdev
128     26  0.129688  0.499376
512     55  0.275874  0.499671
1024    91  0.454159  0.499936
2048    12  0.062298  0.499150
3072     0  0.001556  0.498364
2800     1  0.004954  0.498573

I can't understand why this happens... the 3072 value appears a lot of time in the data... This is the histogram of my datas... this is very strange because there are some values for 3072 and also for 4096.

enter image description here

like image 896
Usi Usi Avatar asked Jun 14 '15 10:06

Usi Usi


People also ask

What is a Gaussian kernel density estimate?

The bottom-right plot shows a Gaussian kernel density estimate, in which each point contributes a Gaussian curve to the total. The result is a smooth density estimate which is derived from the data, and functions as a powerful non-parametric model of the distribution of points.

How do you calculate kernel density?

Kernel Density Estimation (KDE) It is estimated simply by adding the kernel values (K) from all Xj. With reference to the above table, KDE for whole data set is obtained by adding all row values. The sum is then normalized by dividing the number of data points, which is six in this example.


1 Answers

A few general comments without going into statsmodels details.

statsmodels also has cdf kernels, but I don't remember how well they work, and I don't think it has automatic bandwidth selection for it.

Related to the answer of glen_b that ali_m linked to in the comment:

The cdf estimate converges much faster to the true distribution than the estimate of the density as the sample grows. To balance the bias - variance tradeoff we should use a smaller bandwidth for cdf kernels, that is undersmooth relative to density estimation. The estimates should be more accurate than the corresponding density estimates.

Number of tail observations:

If your largest observation in the sample is 4 and you want to know the cdf at 5, then your data has no information about it. For tails where you only have very few observations the variance of a nonparametric estimator like kernel distribution estimators will be large in relative terms (is it 1e-5 or 1e-20?).

As alternative to kernel density or kernel distribution estimation, we can estimate a Pareto distribution for the tail parts. For example, take the largest 10 or 20 percent of observations and fit a Pareto distribution, and use this to extrapolate the tail density. There are several Python packages for powerlaw estimation, that might be used for the this.

update

The following shows how to calculate "outlyingness" using a parametric normal distribution assumption and a gaussian kernel density estimate with fixed bandwidth.

This is only really correct if the sample comes from a continuous distribution or can be approximated by a continuous distribution. Here we pretend that a sample that has only 3 distinct values comes from a normal distribution. Essentially, the calculated cdf value is like a distance measure not a probability for a discrete random variable.

This uses kde from scipy.stats with fixed bandwidth instead of the statsmodels version.

I'm not sure how the bandwidth is set in scipy's gaussian_kde, so, my fixed bandwidth choice equal to scale Is likely wrong. I don't know how I would choose a bandwidth if there are only three distinct values, there is not enough information in the data. The default bandwidth is intended for distributions that are approximately normal, or at least single peaked.

import numpy as np
from scipy import stats

# data
ram = np.array([2, <truncated from data in description>, 3])

loc = ram.mean()
scale = ram.std()

# new values
x = np.asarray([-1, 0, 2, 3, 4, 5, 100])

# assume normal distribution
cdf_val = stats.norm.cdf(x, loc=loc, scale=scale)
cdfv = np.minimum(cdf_val, 1 - cdf_val)

# use gaussian kde but fix bandwidth
kde = stats.gaussian_kde(ram, bw_method=scale)
kde_val = np.asarray([kde.integrate_box_1d(-np.inf, xx) for xx in  x])
kdev = np.minimum(kde_val, 1 - kde_val)


#print(np.column_stack((x, cdfv, kdev)))
# use pandas for prettier table
import pandas as pd
print(pd.DataFrame({'cdfv': cdfv, 'kdev': kdev}, index=x))

'''
          cdfv      kdev
-1    0.000096  0.000417
 0    0.006171  0.021262
 2    0.479955  0.482227
 3    0.119854  0.199565
 5    0.000143  0.000472
 100  0.000000  0.000000
 '''
like image 164
Josef Avatar answered Sep 24 '22 19:09

Josef