Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to estimate density function and calculate its peaks?

I have started to use python for analysis. I would like to do the following:

  1. Get the distribution of dataset
  2. Get the peaks in this distribution

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df)) 
y = pdf(x)                          

pl.plot(x, y, color = 'r') 
pl.hist(data_column, normed= True)
pl.show(block=True)       
like image 799
Yasmin Avatar asked Jul 06 '15 14:07

Yasmin


People also ask

How do you find the density function?

On differentiating the cumulative distribution function, we obtain the probability density function. The mean of the probability density function can be give as E[X] = μ=∫∞−∞xf(x)dx μ = ∫ − ∞ ∞ x f ( x ) d x .

How do you find the peak of a distribution?

Set a variable max = 0. Then calculate value of Y at each X. If Y(X1) > max then set max=Y(X). Once you go through all the Ys, what you'll have in max will be the peak value of Y.

What is the peak of a density plot?

One such plot is the density plot. The Density Plot shows the smoothed distribution of the points along the numeric axis. The peaks of the density plot are at the locations where there is the highest concentration of points.

What is density estimation it estimates probability density function?

In probability and statistics, density estimation is the construction of an estimate, based on observed data, of an unobservable underlying probability density function.


1 Answers

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)

# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')

enter image description here

From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]
like image 155
Jianxun Li Avatar answered Sep 28 '22 17:09

Jianxun Li