Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed hierarchical clustering

Tags:

Are there any algorithms that can help with hierarchical clustering? Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes. Other resource that I found is: http://issues.apache.org/jira/browse/MAHOUT-19 But it's not apparent, which algorithms are used.

like image 938
Roman Avatar asked Sep 17 '08 16:09

Roman


People also ask

What are the two types of hierarchical clustering?

There are two types of hierarchical clustering: divisive (top-down) and agglomerative (bottom-up).

What is the distributed model of clustering?

Distribution clustering: Distribution-based clustering directly relates to the use of distribution models (e.g. Gaussian/Normal) in statistics. Fundamentally, clusters are defined based on how likely the objects included are likely to belong to the same distribution.

What is the difference between agglomerative and divisive hierarchical clustering?

Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

What is hierarchical clustering and its types?

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom. For example, all files and folders on the hard disk are organized in a hierarchy. There are two types of hierarchical clustering, Divisive and Agglomerative.


2 Answers

First, you have to decide if you're going to build your hierarchy bottom-up or top-down.

Bottom-up is called Hierarchical agglomerative clustering. Here's a simple, well-documented algorithm: http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html.

Distributing a bottom-up algorithm is tricky because each distributed process needs the entire dataset to make choices about appropriate clusters. It also needs a list of clusters at its current level so it doesn't add a data point to more than one cluster at the same level.

Top-down hierarchy construction is called Divisive clustering. K-means is one option to decide how to split your hierarchy's nodes. This paper looks at K-means and Principal Direction Divisive Partitioning (PDDP) for node splitting: http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tm07book.pdf. In the end, you just need to split each parent node into relatively well-balanced child nodes.

A top-down approach is easier to distribute. After your first node split, each node created can be shipped to a distributed process to be split again and so on... Each distributed process needs only to be aware of the subset of the dataset it is splitting. Only the parent process is aware of the full dataset.

In addition, each split could be performed in parallel. Two examples for k-means:

  • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.1882&rep=rep1&type=pdf
  • http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html.
like image 144
Corbin March Avatar answered Sep 20 '22 06:09

Corbin March


Clark Olson reviews several distributed algorithms for hierarchical clustering:

C. F. Olson. "Parallel Algorithms for Hierarchical Clustering." Parallel Computing, 21:1313-1325, 1995, doi:10.1016/0167-8191(95)00017-I.

Parunak et al. describe an algorithm inspired by how ants sort their nests:

H. Van Dyke Parunak, Richard Rohwer, Theodore C. Belding,and Sven Brueckner: "Dynamic Decentralized Any-Time Hierarchical Clustering." In Proc. 4th International Workshop on Engineering Self-Organising Systems (ESOA), 2006, doi:10.1007/978-3-540-69868-5

like image 45
Vebjorn Ljosa Avatar answered Sep 21 '22 06:09

Vebjorn Ljosa