Dendrogram y-axis labeling confusion

Tags:

I have a large (106x106) correlation matrix in pandas with the following structure:

+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
|   |         0         |        1         |        2         |        3         |        4         |        5        |        6         |        7         |        8         |         9         |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
| 0 |               1.0 |   0.465539925807 |   0.736955649673 |   0.733077703346 |  -0.177380436347 | -0.268022641963 |  0.0642473239514 | -0.0136866435594 |  -0.025596700815 | -0.00385065532308 |
| 1 |    0.465539925807 |              1.0 |  -0.173472213691 |   -0.16898620433 | -0.0460674481563 | 0.0994673318696 |   0.137137216943 |   0.061999118034 |  0.0944808695878 |   0.0229095105328 |
| 2 |    0.736955649673 |  -0.173472213691 |              1.0 |   0.996627003263 |  -0.172683935315 |  -0.33319698831 | -0.0562591684255 | -0.0306820050477 | -0.0657065745626 |  -0.0457836647012 |
| 3 |    0.733077703346 |   -0.16898620433 |   0.996627003263 |              1.0 |  -0.153606414649 | -0.321562257834 | -0.0465540370732 | -0.0224318843281 | -0.0586629098513 |  -0.0417237678539 |
| 4 |   -0.177380436347 | -0.0460674481563 |  -0.172683935315 |  -0.153606414649 |              1.0 | 0.0148395123941 |   0.191615549534 |   0.289211355855 |    0.28799868259 |    0.291523969899 |
| 5 |   -0.268022641963 |  0.0994673318696 |   -0.33319698831 |  -0.321562257834 |  0.0148395123941 |             1.0 |   0.205432455075 |   0.445668299971 |   0.454982398693 |    0.427323555674 |
| 6 |   0.0642473239514 |   0.137137216943 | -0.0562591684255 | -0.0465540370732 |   0.191615549534 |  0.205432455075 |              1.0 |   0.674329392219 |   0.727261969241 |     0.67891326835 |
| 7 |  -0.0136866435594 |   0.061999118034 | -0.0306820050477 | -0.0224318843281 |   0.289211355855 |  0.445668299971 |   0.674329392219 |              1.0 |   0.980543049288 |    0.939548790275 |
| 8 |   -0.025596700815 |  0.0944808695878 | -0.0657065745626 | -0.0586629098513 |    0.28799868259 |  0.454982398693 |   0.727261969241 |   0.980543049288 |              1.0 |    0.930281915882 |
| 9 | -0.00385065532308 |  0.0229095105328 | -0.0457836647012 | -0.0417237678539 |   0.291523969899 |  0.427323555674 |    0.67891326835 |   0.939548790275 |   0.930281915882 |               1.0 |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+

Truncated here for simplicity.

If I calculate the linkage, and later plot the dendrogram using the following code:

from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(result_df.corr(),'average')
plt.figure()
fig, axes = plt.subplots(1, 1, figsize=(20, 20))
axes.tick_params(axis='both', which='major', labelsize=15)

dendrogram(Z=Z, labels=result_df_null_cols.columns, 
           leaf_rotation=90., ax=axes, 
           color_threshold=2.)

It yields a dendrogram like: enter image description here

My question is surrounding the y-axis. On all examples I have seen, the Y axis is bound between 0,2 - which I have read to interpret as (1-corr). In my result, the boundary is much higher. 0 being items that are highly correlated (1-1 = 0), and 2 being the cutoff on lowly correlated stuff (1 - -1 = 2).

I found the following answer but it does not agree with this answer and the referenced lecture notes here.

Anyway - hoping someone can clarify which source is the correct one, and help spread some knowledge on the topic.

950

asked Feb 21 '19 20:02

jason m

1 Answers

The metric, which has been used for the linkage() is the euclidean distance, see here and not the actual values. Therefore, it can go beyond 2 and it purely depends on the type of distance metric we use.

This supports the points, mentioned in this answer.

1) The y-axis is a measure of closeness of either individual data points or clusters.

Then, these distances are used to compute the tree, using the following calculation between every pair of clusters.

From Documentation:

enter image description here

In your mentioned sample,

Even though the individual values does not go beyond (-1, +1), we will get the following dendrogram.

enter image description here

from scipy.spatial import distance
distance.pdist(df, 'euclidean')

The reason is that the distance array of size 45 (10 C 2 - every pair of columns; ordering is explained here) would have following values:

array([1.546726  , 0.79914141, 0.79426728, 2.24085106, 2.50838998,
       2.22772899, 2.52578923, 2.55978527, 2.51553289, 2.11329023,
       2.10501739, 1.66536963, 1.6303103 , 1.71821177, 2.04386712,
       2.03917033, 2.03614219, 0.0280283 , 2.33440388, 2.68373496,
       2.43771817, 2.68351612, 2.73148741, 2.66843754, 2.31758222,
       2.67031469, 2.4206485 , 2.66539997, 2.7134241 , 2.65058045,
       1.44756593, 1.39699605, 1.55063416, 1.56324546, 1.52001219,
       1.32204039, 1.30206957, 1.29596715, 1.2895916 , 0.65145881,
       0.62242858, 0.6283212 , 0.08642582, 0.11145739, 0.14420816])

If we build a random value matrix with uniform dist. (-1, 1) of size (160, 160), the dendrogram would be something similar to this!

enter image description here

Hence, the solution for your problem is,

You need to convert the correlation values into some form of distance measure.

we could use the same squareform() suggested in the other answer. This is a duct tape approach to achieve two aspects of a distance measure. It has to be zero [between same two points] and non-negatives for any two points. This can be achieved by subtracting each corr value from one.
Directly we can use the distance.pdist function with correlation as metric. The implementation is available here. Remember to transform the dataframe, because we need correlation between each column and not row.

Example to understand the solution:

size = (10000,1)
col1 = np.random.randint(0,100,size) # base column
col2 = col1 * 0.9 + np.random.normal(0,2,size) # huge corr with small noise
col3 = col1 * 0.1 + np.random.normal(0,100,size) # uncorrelated column
col4 = col1 * (-0.5) + np.random.normal(0,1,size) # negatively corr

data = np.hstack((col1,col2,col3,col4))
df = pd.DataFrame(data , columns=list('ABCD'))

df.corr()

    A   B   C   D
A   1.000000    0.997042    0.029078    -0.997614
B   0.997042    1.000000    0.029233    -0.994677
C   0.029078    0.029233    1.000000    -0.028421
D   -0.997614   -0.994677   -0.028421   1.000000

#pdist_values = distance.squareform(1 - df.corr().values )
pdist_values = distance.pdist(df.T, 'correlation')
z = linkage(pdist_values, method='average')
dendrogram(z, labels=df.columns)

enter image description here

109

answered Sep 30 '22 13:09

Venkatachalam

Related questions
                            
                                How do I connect to a remote Neo4j database using gremlin python?
                            
                                executing two class methods at the same time in Python
                            
                                Quickly check large database for edit-distance similarity
                            
                                Can't install Matplotlib in python 3.7
                            
                                Convert structured array to numpy array for use with Scikit-Learn
                            
                                Python how to mock a function within another function
                            
                                How to parse a pandas column of JSON content efficiently?
                            
                                What is the best project structure to use when developing for airflow?
                            
                                Equivalent of apt-get install python3.6-dev for conda
                            
                                Avoid type warnings when mocking objects in unit tests?
                            
                                Dependency between "Session/line number was not unique in database." error and Python code
                            
                                Dumping XGBClassifier model into text
                            
                                Flask Rest API - How to use Bearer API token in python requests
                            
                                Variables starting with underscore for property decorator
                            
                                VS Code - pylinter cannot find module
                            
                                Constraint optimisation with google operations research tools
                            
                                Python flask saml throwing saml2.sigver.SigverError Error Message
                            
                                Error Python PlaySound No module named 'gi'
                            
                                Shape not aligned error in OLS Regression python
                            
                                sympy: how to simplify across multiple expressions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dendrogram y-axis labeling confusion

Tags:

python

scikit-learn

dendrogram

hierarchical-clustering

jason m

People also ask

1 Answers

Venkatachalam

Recent Activity

Donate For Us