Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy linkage format

I have written my own clustering routine and would like to produce a dendrogram. The easiest way to do this would be to use scipy dendrogram function. However, this requires the input to be in the same format that the scipy linkage function produces. I cannot find an example of how the output of this is formatted. I was wondering whether someone out there can enlighten me.

like image 531
geo_pythoncl Avatar asked Mar 23 '12 12:03

geo_pythoncl


People also ask

What does scipy linkage do?

This version tracks the distance (size) of each cluster (node_id), and confirms the number of members. This uses the scipy linkage() function which is the same foundation of the Aggregator clusterer.

What is the linkage matrix?

A linkage matrix is valid if it is a 2-D array (type double) with rows and 4 columns. The first two columns must contain indices between 0 and 2 n − 1 . For a given row i , the following two expressions have to hold: 0 ≤ Z [ i , 0 ] ≤ i + n − 1 0 ≤ Z [ i , 1 ] ≤ i + n − 1.

What is linkage Python?

A linkage function is an essential prerequisite for hierarchical cluster analysis . Its value is a measure of the "distance" between two groups of objects (i.e. between two clusters). Algorithms for hierarchical clustering normally differ by the linkage function used.

What is scipy cluster hierarchy?

cluster. hierarchy ) These functions cut hierarchical clusterings into flat clusterings or find the roots of the forest formed by a cut by providing the flat cluster ids of each observation.

How does the linkage matrix work in SciPy?

From the scipy docs for the linkage matrix: A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z [i, 0] and Z [i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations.

What is the output format of the cluster linkage function?

This is from the scipy.cluster.hierarchy.linkage () function documentation, I think it's a pretty clear description for the output format: A ( n -1) by 4 matrix Z is returned. At the i -th iteration, clusters with indices Z [i, 0] and Z [i, 1] are combined to form cluster n + i.

What is the benefit of using SciPy library in Python for ML?

The benefit of using SciPy library in Python while making ML models is that it also makes a strong programming language available for use in developing less complex programs and applications. LU decomposition is a method that reduce matrix into constituent parts that helps in easier calculation of complex matrix operations.

What is the use of SciPy?

SciPy is an interactive Python session used as a data-processing library that is made to compete with its rivalries such as MATLAB, Octave, R-Lab,etc. It has many user-friendly, efficient and easy-to-use functions that helps to solve problems like numerical integration, interpolation, optimization, linear algebra and statistics.


1 Answers

I agree with https://stackoverflow.com/users/1167475/mortonjt that the documentation does not fully explain the indexing of intermediate clusters, while I do agree with the https://stackoverflow.com/users/1354844/dkar that the format is otherwise precisely explained.

Using the example data from this question: Tutorial for scipy.cluster.hierarchy

A = np.array([[0.1,   2.5],               [1.5,   .4 ],               [0.3,   1  ],               [1  ,   .8 ],               [0.5,   0  ],               [0  ,   0.5],               [0.5,   0.5],               [2.7,   2  ],               [2.2,   3.1],               [3  ,   2  ],               [3.2,   1.3]]) 

A linkage matrix can be built using the single (i.e, the closest matching points):

z = hac.linkage(a, method="single")   array([[  7.        ,   9.        ,   0.3       ,   2.        ],         [  4.        ,   6.        ,   0.5       ,   2.        ],         [  5.        ,  12.        ,   0.5       ,   3.        ],         [  2.        ,  13.        ,   0.53851648,   4.        ],         [  3.        ,  14.        ,   0.58309519,   5.        ],         [  1.        ,  15.        ,   0.64031242,   6.        ],         [ 10.        ,  11.        ,   0.72801099,   3.        ],         [  8.        ,  17.        ,   1.2083046 ,   4.        ],         [  0.        ,  16.        ,   1.5132746 ,   7.        ],         [ 18.        ,  19.        ,   1.92353841,  11.        ]]) 

As the documentation explains the clusters below n (here: 11) are simply the data points in the original matrix A. The intermediate clusters going forward, are indexed successively.

Thus, clusters 7 and 9 (the first merge) are merged into cluster 11, clusters 4 and 6 into 12. Then observe line three, merging clusters 5 (from A) and 12 (from the not-shown intermediate cluster 12) resulting with a Within-Cluster Distance (WCD) of 0.5. The single method entails that the new WCS is 0.5, which is the distance between A[5] and the closest point in cluster 12, A[4] and A[6]. Let's check:

 In [198]: norm([a[5]-a[4]])  Out[198]: 0.70710678118654757  In [199]: norm([a[5]-a[6]])  Out[199]: 0.5 

This cluster should now be intermediate cluster 13, which subsequently is merged with A[2]. Thus, the new distance should be the closest between the points A[2] and A[4,5,6].

 In [200]: norm([a[2]-a[4]])  Out[200]: 1.019803902718557  In [201]: norm([a[2]-a[5]])  Out[201]: 0.58309518948452999  In [202]: norm([a[2]-a[6]])  Out[202]: 0.53851648071345048 

Which, as can be seen also checks out, and explains the intermediate format of new clusters.

like image 134
user1603472 Avatar answered Sep 25 '22 22:09

user1603472