Choosing the number of clusters in heirarchical agglomerative clustering with scikit

Tags:

The wikipedia article on determining the number of clusters in a dataset indicated that I do not need to worry about such a problem when using hierarchical clustering. However when I tried to use scikit-learn's agglomerative clustering I see that I have to feed it the number of clusters as a parameter "n_clusters" - without which I get the hardcoded default of two clusters. How can I go about choosing the right number of cluster's for the dataset in this case? Is the wiki article wrong?

870

asked Aug 26 '15 09:08

DaTaBomB

1 Answers

Wikipedia is simply making an extreme simplification which has nothing to do with real life. Hierarchical clustering does not avoid the problem with number of clusters. Simply - it constructs the tree spaning over all samples, which shows which samples (later on - clusters) merge together to create a bigger cluster. This happend recursively till you have just two clusters (this is why default number of clusters is 2) which are merged to the whole dataset. You are left alone with "cutting" through the tree to get actual clustering. Once you fit AgglomerativeClustering you can traverse the whole tree and analyze which clusters to keep

import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools

X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
clustering = AgglomerativeClustering()
clustering.fit(X)

[{'node_id': next(itertools.count(X.shape[0])), 'left': x[0], 'right':x[1]} for x in clustering.children_]

117

answered Sep 20 '22 05:09

lejlot

Related questions
                            
                                One dimensional data with CNN
                            
                                AttributeError: module 'tensorflow.contrib.learn' has no attribute 'TensorFlowDNNClassifier'
                            
                                How to create my own datasets using in scikit-learn?
                            
                                AttributeError:'Tensor' object has no attribute '_keras_history'
                            
                                Add hand-crafted features to Keras sequential model
                            
                                How can you re-use a variable scope in tensorflow without a new scope being created by default?
                            
                                Pytorch: How to create an update rule that doesn't come from derivatives?
                            
                                Sigmoid output - can it be interpreted as probability?
                            
                                Difference between predict vs predict_proba in scikit-learn
                            
                                Weighted Decision Trees using Entropy
                            
                                Genetic Programming - Fitness functions
                            
                                Need a specific example of U-Matrix in Self Organizing Map
                            
                                Genetic algorithms: fitness function for feature selection algorithm
                            
                                Genetic algorithm example/tutorial for PyBrain?
                            
                                Why is KNN much faster than decision tree?
                            
                                Php machine-learning library? [closed]
                            
                                Questions about Q-Learning using Neural Networks
                            
                                Neural Networks: Does the input layer consist of neurons?
                            
                                Scikit-learn: How to calculate the True Negative
                            
                                Scikit F-score metric error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Choosing the number of clusters in heirarchical agglomerative clustering with scikit

Tags:

artificial-intelligence

machine-learning

cluster-analysis

unsupervised-learning

scikit-learn

DaTaBomB

People also ask

1 Answers

lejlot

Recent Activity

Donate For Us