I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3
could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott'
and bw=0.3
. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde
is used (not scipy.stats.gaussian_kde
). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34)
. The min()
clarifies the abrupt change in behavior. IQR
is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel
backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde
.
If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.
We can, however, try to separate the sample into two sub-samples:
(This idea has already been mentionned by johanc).
Below is an attempt to perform this classification. The np.unique
method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.
The following function will be useful in order to overcome a limitation with the current implementation of the draw
method of Mixtures with OpenTURNS.
def DrawMixtureWithDiracs(distribution):
"""Draw a distributions which has Diracs.
https://github.com/openturns/openturns/issues/1489"""
graph = distribution.drawPDF()
graph.setLegends(["Mixture"])
for atom in distribution.getDistributionCollection():
if atom.getName() == "Dirac":
curve = atom.drawPDF()
curve.setLegends(["Dirac"])
graph.add(curve)
return graph
The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.
import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)
This is the result.
Then we create a sample.
sample = distribution.getSample(100)
This is where your problem begins. We count the number of occurences of each realizations.
array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
return_counts=True)
For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.
sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
if count[i] == 1:
uniqueValues.append(unique[i][0])
else:
atom = ot.Dirac(unique[i])
listOfDiracs.append(atom)
w = count[i] / sampleSize
print("New Dirac =", unique[i], " with weight =", w)
listOfWeights.append(w)
The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.
complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)
The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.
sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)
Et voilà: the Mixture is ready.
mixture_estimated = ot.Mixture(atoms, weights)
The following script compares the initial Mixture and the estimated one.
graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph
The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With