My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters. This <code>R Implementation</code> https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the <code>R implementation</code> uses a Gibbs sampler, I'm not sure if that affects this. What confuses me is the <code>n_components</code> parameters. <code>n_components: int, default 1 : Number of mixture components.</code> If the number of components is determined by the data and the Dirichlet Process, then what is this parameter? <hr> Ultimately, I'm trying to get: (1) the cluster assignment for each sample; (2) the probability vectors for each cluster; and (3) the likelihood/log-likelihood for each sample. It looks like (1) is the <code>predict</code> method, and (3) is the <code>score</code> method. However, the output of (1) is completely dependent on the <code>n_components</code> hyperparameter. My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was <code>Dirichlet Process</code> in <code>Scikit-learn</code> that I wanted to try out. <hr> Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html Here's my naive usage: <pre class="prettyprint"><code>from sklearn.mixture import DPGMM X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0) Mod_dpgmm = DPGMM(n_components=3) Mod_dpgmm.fit(X) </code></pre>

Now the Class DPGMM is decrecated. as the warning show: DeprecationWarning: Class DPGMM is deprecated; The <code>DPGMM</code> class is not working correctly and it's better to use <code>sklearn.mixture.BayesianGaussianMixture</code> class with parameter <code>weight_concentration_prior_type='dirichlet_process'</code> instead. DPGMM is deprecated in 0.18 and will be removed in 0.20.

As mentioned by @maxymoo in the comments, <code>n_components</code> is a truncation parameter. In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster <code>k</code> with probability <code>|k| / n-1+alpha</code> and starts a new cluster with probability <code>alpha / n-1 + alpha</code>. This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters. Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. This can be related to the difference in results. A gentle Dirichlet Process tutorial can be found here.

How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)

Tags:

python

machine-learning

statistics

scikit-learn

bayesian

My understanding of "an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters" is that the number of clusters is determined by the data as they converge to a certain amount of clusters.

This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation uses a Gibbs sampler, I'm not sure if that affects this.

What confuses me is the n_components parameters. n_components: int, default 1 : Number of mixture components. If the number of components is determined by the data and the Dirichlet Process, then what is this parameter?

Ultimately, I'm trying to get:

(1) the cluster assignment for each sample;

(2) the probability vectors for each cluster; and

(3) the likelihood/log-likelihood for each sample.

It looks like (1) is the predict method, and (3) is the score method. However, the output of (1) is completely dependent on the n_components hyperparameter.

My apologies if this is a naive question, I'm very new to Bayesian programming and noticed there was Dirichlet Process in Scikit-learn that I wanted to try out.

Here's the docs: http://scikit-learn.org/stable/modules/generated/sklearn.mixture.DPGMM.html#sklearn.mixture.DPGMM

Here's an example of usage: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html

Here's my naive usage:

from sklearn.mixture import DPGMM
X = pd.read_table("Data/processed/data.tsv", sep="\t", index_col=0)
Mod_dpgmm = DPGMM(n_components=3)
Mod_dpgmm.fit(X)

500

asked Aug 22 '16 22:08

O.rka

2 Answers

Now the Class DPGMM is decrecated. as the warning show: DeprecationWarning: Class DPGMM is deprecated; The DPGMM class is not working correctly and it's better to use sklearn.mixture.BayesianGaussianMixture class with parameter weight_concentration_prior_type='dirichlet_process' instead. DPGMM is deprecated in 0.18 and will be removed in 0.20.

answered Oct 04 '22 00:10

Xiang Zhang

As mentioned by @maxymoo in the comments, n_components is a truncation parameter.

In the context of the Chinese Restaurant Process, which is related to the Stick-breaking representation in sklearn's DP-GMM, a new data point joins an existing cluster k with probability |k| / n-1+alpha and starts a new cluster with probability alpha / n-1 + alpha. This parameter can be interpreted as the concentration parameter of the Dirichlet Process and it will influence the final number of clusters.

Unlike R's implementation that uses Gibbs sampling, sklearn's DP-GMM implementation uses variational inference. This can be related to the difference in results.

A gentle Dirichlet Process tutorial can be found here.

149

answered Oct 04 '22 00:10

rafaelvalle

Related questions
                            
                                Django settings not configured error
                            
                                Tensor Flow: Ran out of memory trying to allocate
                            
                                Summing data from array based on other array in Numpy
                            
                                Calculating IDF using TfidfVectorizer from sklearn.feature_extraction.text.TfidfVectorizer
                            
                                Preferred way to empty multiprocessing.queue(-1) in python
                            
                                How to output full diffs in Django unit tests?
                            
                                Using python to calculate radial angle, in clockwise/counterclockwise directions, given pixel coordinates (and then vice-versa)
                            
                                Handle CTRL-C in Python cmd module
                            
                                Using setuptools, how can I download external data upon installation?
                            
                                "ValueError: labels ['timestamp'] not contained in axis" error
                            
                                Multi-variable linear regression with scipy linregress
                            
                                Updating a pandas DataFrame row with a dictionary
                            
                                Given a byte buffer, dtype, shape and strides, how to create Numpy ndarray
                            
                                Tensorflow error: "Tensor must be from the same graph as Tensor..."
                            
                                Sum of Two Integers without using "+" operator in python
                            
                                Python scikit learn multi-class multi-label performance metrics?
                            
                                Is there any function in python which can perform the inverse of numpy.repeat function?
                            
                                Failure to import matplotlib.pyplot in jupyter (but not ipython)
                            
                                Operation on numpy arrays contain rows with different size
                            
                                How to modify cells in a pandas DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With