I have written my own clustering routine and would like to produce a dendrogram. The easiest way to do this would be to use scipy dendrogram function. However, this requires the input to be in the same format that the scipy linkage function produces. I cannot find an example of how the output of this is formatted. I was wondering whether someone out there can enlighten me.
This version tracks the distance (size) of each cluster (node_id), and confirms the number of members. This uses the scipy linkage() function which is the same foundation of the Aggregator clusterer.
A linkage matrix is valid if it is a 2-D array (type double) with rows and 4 columns. The first two columns must contain indices between 0 and 2 n − 1 . For a given row i , the following two expressions have to hold: 0 ≤ Z [ i , 0 ] ≤ i + n − 1 0 ≤ Z [ i , 1 ] ≤ i + n − 1.
A linkage function is an essential prerequisite for hierarchical cluster analysis . Its value is a measure of the "distance" between two groups of objects (i.e. between two clusters). Algorithms for hierarchical clustering normally differ by the linkage function used.
cluster. hierarchy ) These functions cut hierarchical clusterings into flat clusterings or find the roots of the forest formed by a cut by providing the flat cluster ids of each observation.
From the scipy docs for the linkage matrix: A (n-1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z [i, 0] and Z [i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations.
This is from the scipy.cluster.hierarchy.linkage () function documentation, I think it's a pretty clear description for the output format: A ( n -1) by 4 matrix Z is returned. At the i -th iteration, clusters with indices Z [i, 0] and Z [i, 1] are combined to form cluster n + i.
The benefit of using SciPy library in Python while making ML models is that it also makes a strong programming language available for use in developing less complex programs and applications. LU decomposition is a method that reduce matrix into constituent parts that helps in easier calculation of complex matrix operations.
SciPy is an interactive Python session used as a data-processing library that is made to compete with its rivalries such as MATLAB, Octave, R-Lab,etc. It has many user-friendly, efficient and easy-to-use functions that helps to solve problems like numerical integration, interpolation, optimization, linear algebra and statistics.
I agree with https://stackoverflow.com/users/1167475/mortonjt that the documentation does not fully explain the indexing of intermediate clusters, while I do agree with the https://stackoverflow.com/users/1354844/dkar that the format is otherwise precisely explained.
Using the example data from this question: Tutorial for scipy.cluster.hierarchy
A = np.array([[0.1, 2.5], [1.5, .4 ], [0.3, 1 ], [1 , .8 ], [0.5, 0 ], [0 , 0.5], [0.5, 0.5], [2.7, 2 ], [2.2, 3.1], [3 , 2 ], [3.2, 1.3]])
A linkage matrix can be built using the single (i.e, the closest matching points):
z = hac.linkage(a, method="single") array([[ 7. , 9. , 0.3 , 2. ], [ 4. , 6. , 0.5 , 2. ], [ 5. , 12. , 0.5 , 3. ], [ 2. , 13. , 0.53851648, 4. ], [ 3. , 14. , 0.58309519, 5. ], [ 1. , 15. , 0.64031242, 6. ], [ 10. , 11. , 0.72801099, 3. ], [ 8. , 17. , 1.2083046 , 4. ], [ 0. , 16. , 1.5132746 , 7. ], [ 18. , 19. , 1.92353841, 11. ]])
As the documentation explains the clusters below n (here: 11) are simply the data points in the original matrix A. The intermediate clusters going forward, are indexed successively.
Thus, clusters 7 and 9 (the first merge) are merged into cluster 11, clusters 4 and 6 into 12. Then observe line three, merging clusters 5 (from A) and 12 (from the not-shown intermediate cluster 12) resulting with a Within-Cluster Distance (WCD) of 0.5. The single method entails that the new WCS is 0.5, which is the distance between A[5] and the closest point in cluster 12, A[4] and A[6]. Let's check:
In [198]: norm([a[5]-a[4]]) Out[198]: 0.70710678118654757 In [199]: norm([a[5]-a[6]]) Out[199]: 0.5
This cluster should now be intermediate cluster 13, which subsequently is merged with A[2]. Thus, the new distance should be the closest between the points A[2] and A[4,5,6].
In [200]: norm([a[2]-a[4]]) Out[200]: 1.019803902718557 In [201]: norm([a[2]-a[5]]) Out[201]: 0.58309518948452999 In [202]: norm([a[2]-a[6]]) Out[202]: 0.53851648071345048
Which, as can be seen also checks out, and explains the intermediate format of new clusters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With