Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to convert a dataframe of counts to a probability density function

Suppose that I have the following observations of integers:

df = pd.DataFrame({'observed_scores': [100, 100, 90, 85, 100, ...]})

I know that this can be used as an input to make a density plot:

df['observed_scores'].plot.density()

but suppose that what I have is a counts table:

df = pd.DataFrame({'observed_scores': [100, 95, 90, 85, ...], 'counts': [1534, 1399, 3421, 8764, ...})

which is cheaper to store than the full observed_scores Series (I have LOTS of observations).

I know it's possible to plot the histogram using the counts, but how do I plot the density plot? If possible, can it be done without having to unstack/unravel the counts table into thousands of rows?

like image 350
irene Avatar asked Jun 22 '20 15:06

irene


People also ask

How do you convert a count to probability?

So, you need to convert from counts to probabilities. It is actually easy - you need to sum all counts and then divide each letter's value to the total number of letters in the text.

How do you find the probability density of data?

The function fX(x) gives us the probability density at point x. It is the limit of the probability of the interval (x,x+Δ] divided by the length of the interval as the length of the interval goes to 0. Remember that P(x<X≤x+Δ)=FX(x+Δ)−FX(x). =dFX(x)dx=F′X(x),if FX(x) is differentiable at x.

How do you convert density to probability?

To translate the probability density ρ(x) into a probability, imagine that Ix is some small interval around the point x. Then, assuming ρ is continuous, the probability that X is in that interval will depend both on the density ρ(x) and the length of the interval: Pr(X∈Ix)≈ρ(x)×Length of Ix.

How do you plot a PDF and CDF in Python?

MatPlotLib with Python Compute the histogram of a set of data with data and bins=10. Find the probability distribution function (pdf). Using pdf (Step 5), calculate cdf. Plot the cdf using plot() method with label "CDF".


1 Answers

IIUC, statsmodels lets you fit a weighted KDE:

from statsmodels.nonparametric.kde import KDEUnivariate

df = pd.DataFrame({'observed_scores': [100, 95, 90, 85],
                   'counts': [1534, 1399, 3421, 8764]})

kde1= KDEUnivariate(df.observed_scores)
kde_noweight = KDEUnivariate(df.observed_scores)
kde1.fit(weights=df.counts, fft=False)
kde_noweight.fit()
plt.plot(kde1.support, kde1.density)
plt.plot(kde_noweight.support, kde_noweight.density)
plt.legend(['weighted', 'unweighted'])

Output:

enter image description here

like image 68
Juan C Avatar answered Oct 04 '22 09:10

Juan C