My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters.
This R Implementation
https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation
uses a Gibbs sampler, I'm not sure if that affects this.
What confuses me is the n_components
parameters. n_components: int, default 1 :
Number of mixture components.
If the number of components is determined by the data and the Dirichlet Process, then what is this parameter?
Ultimately, I'm trying to get:
(1) the cluster assignment for each sample;
(2) the probability vectors for each cluster; and
(3) the likelihood/log-likelihood for each sample.
It looks like (1) is the predict
method, and (3) is the score
method. However, the output of (1) is completely dependent on the n_components
hyperparameter.
My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process
in Scikit-learn
that I wanted to try out.
Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM
Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html
Here's my naive usage:
from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)
Dirichlet Process Mixture (DPM) is a model used for clustering with the advantage of discovering the number of clusters automatically and offering nice properties like, e.g., its potential convergence to the actual clusters in the data.
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
To sample a point from the GMM, first choose a mixture component by drawing j from the categorical distribution with probabilities [π1,…,πd]. This can be done using a random number generator for the categorical distribution.
Gaussian mixture models (GMMs) are a type of machine learning algorithm. They are used to classify data into different categories based on the probability distribution. Gaussian mixture models can be used in many different areas, including finance, marketing and so much more!
Now the Class DPGMM is decrecated.
as the warning show:
DeprecationWarning: Class DPGMM is deprecated; The DPGMM
class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture
class with parameter weight_concentration_prior_type='dirichlet_process'
instead. DPGMM is deprecated in 0.18 and will be removed in 0.20.
As mentioned by @maxymoo in the comments, n_components
is a truncation parameter.
In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k
with probability |k| / n-1+alpha
and starts a new cluster with probability alpha / n-1 + alpha
. This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters.
Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. This can be related to the difference in results.
A gentle Dirichlet Process tutorial can be found here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With