Suppose that I have the following observations of integers:
df = pd.DataFrame({'observed_scores': [100, 100, 90, 85, 100, ...]})
I know that this can be used as an input to make a density plot:
df['observed_scores'].plot.density()
but suppose that what I have is a counts table:
df = pd.DataFrame({'observed_scores': [100, 95, 90, 85, ...], 'counts': [1534, 1399, 3421, 8764, ...})
which is cheaper to store than the full observed_scores
Series (I have LOTS of observations).
I know it's possible to plot the histogram using the counts, but how do I plot the density plot? If possible, can it be done without having to unstack/unravel the counts table into thousands of rows?
So, you need to convert from counts to probabilities. It is actually easy - you need to sum all counts and then divide each letter's value to the total number of letters in the text.
The function fX(x) gives us the probability density at point x. It is the limit of the probability of the interval (x,x+Δ] divided by the length of the interval as the length of the interval goes to 0. Remember that P(x<X≤x+Δ)=FX(x+Δ)−FX(x). =dFX(x)dx=F′X(x),if FX(x) is differentiable at x.
To translate the probability density ρ(x) into a probability, imagine that Ix is some small interval around the point x. Then, assuming ρ is continuous, the probability that X is in that interval will depend both on the density ρ(x) and the length of the interval: Pr(X∈Ix)≈ρ(x)×Length of Ix.
MatPlotLib with Python Compute the histogram of a set of data with data and bins=10. Find the probability distribution function (pdf). Using pdf (Step 5), calculate cdf. Plot the cdf using plot() method with label "CDF".
IIUC, statsmodels
lets you fit a weighted KDE:
from statsmodels.nonparametric.kde import KDEUnivariate
df = pd.DataFrame({'observed_scores': [100, 95, 90, 85],
'counts': [1534, 1399, 3421, 8764]})
kde1= KDEUnivariate(df.observed_scores)
kde_noweight = KDEUnivariate(df.observed_scores)
kde1.fit(weights=df.counts, fft=False)
kde_noweight.fit()
plt.plot(kde1.support, kde1.density)
plt.plot(kde_noweight.support, kde_noweight.density)
plt.legend(['weighted', 'unweighted'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With