How to estimate density function and calculate its peaks?

Tags:

I have started to use python for analysis. I would like to do the following:

Get the distribution of dataset
Get the peaks in this distribution

I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?

import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df)) 
y = pdf(x)                          

pl.plot(x, y, color = 'r') 
pl.hist(data_column, normed= True)
pl.show(block=True)

799

asked Jul 06 '15 14:07

Yasmin

1 Answers

I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])

df = pd.DataFrame(gaussian_mixture, columns=['data'])

# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)

# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')

enter image description here

From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.

To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.

# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]

155

answered Sep 28 '22 17:09

Jianxun Li

Related questions
                            
                                Catch ExpatError in xmltodict
                            
                                Can't find "six", but it's installed
                            
                                Capturing named groups in regex with re.findall
                            
                                with and closing of files in Python
                            
                                Python: lists and copy of them
                            
                                How can I read the contents of all the files in a directory with pandas?
                            
                                Python: Random number generator with mean and Standard Deviation
                            
                                SQLAlchemy: Convert column value back and forth between internal and database format
                            
                                How can I generate a colormap array from a simple array in matplotlib
                            
                                Python Numpy mask NaN not working
                            
                                Python for loop decrementing index
                            
                                Count occurrences of item in JSON element
                            
                                Convert Content-Type header into file extension
                            
                                Python fails to open 11gb csv in r+ mode but opens in r mode
                            
                                PEP 8 and list comprehension
                            
                                How can I export an instance and all its related objects in Django?
                            
                                Pass Python list to embedded Rust function
                            
                                Alternative to `any` that returns the last evaluated object?
                            
                                Selenium pdf automatic download not working
                            
                                Flask SqlAlchemy join two models without foreign key MYSQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to estimate density function and calculate its peaks?

Tags:

python

pandas

numpy

python-2.7

Yasmin

People also ask

1 Answers

Jianxun Li

Recent Activity

Donate For Us