I have a large (106x106)
correlation matrix in pandas with the following structure:
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
| 0 | 1.0 | 0.465539925807 | 0.736955649673 | 0.733077703346 | -0.177380436347 | -0.268022641963 | 0.0642473239514 | -0.0136866435594 | -0.025596700815 | -0.00385065532308 |
| 1 | 0.465539925807 | 1.0 | -0.173472213691 | -0.16898620433 | -0.0460674481563 | 0.0994673318696 | 0.137137216943 | 0.061999118034 | 0.0944808695878 | 0.0229095105328 |
| 2 | 0.736955649673 | -0.173472213691 | 1.0 | 0.996627003263 | -0.172683935315 | -0.33319698831 | -0.0562591684255 | -0.0306820050477 | -0.0657065745626 | -0.0457836647012 |
| 3 | 0.733077703346 | -0.16898620433 | 0.996627003263 | 1.0 | -0.153606414649 | -0.321562257834 | -0.0465540370732 | -0.0224318843281 | -0.0586629098513 | -0.0417237678539 |
| 4 | -0.177380436347 | -0.0460674481563 | -0.172683935315 | -0.153606414649 | 1.0 | 0.0148395123941 | 0.191615549534 | 0.289211355855 | 0.28799868259 | 0.291523969899 |
| 5 | -0.268022641963 | 0.0994673318696 | -0.33319698831 | -0.321562257834 | 0.0148395123941 | 1.0 | 0.205432455075 | 0.445668299971 | 0.454982398693 | 0.427323555674 |
| 6 | 0.0642473239514 | 0.137137216943 | -0.0562591684255 | -0.0465540370732 | 0.191615549534 | 0.205432455075 | 1.0 | 0.674329392219 | 0.727261969241 | 0.67891326835 |
| 7 | -0.0136866435594 | 0.061999118034 | -0.0306820050477 | -0.0224318843281 | 0.289211355855 | 0.445668299971 | 0.674329392219 | 1.0 | 0.980543049288 | 0.939548790275 |
| 8 | -0.025596700815 | 0.0944808695878 | -0.0657065745626 | -0.0586629098513 | 0.28799868259 | 0.454982398693 | 0.727261969241 | 0.980543049288 | 1.0 | 0.930281915882 |
| 9 | -0.00385065532308 | 0.0229095105328 | -0.0457836647012 | -0.0417237678539 | 0.291523969899 | 0.427323555674 | 0.67891326835 | 0.939548790275 | 0.930281915882 | 1.0 |
+---+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+-------------------+
Truncated here for simplicity.
If I calculate the linkage, and later plot the dendrogram using the following code:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(result_df.corr(),'average')
plt.figure()
fig, axes = plt.subplots(1, 1, figsize=(20, 20))
axes.tick_params(axis='both', which='major', labelsize=15)
dendrogram(Z=Z, labels=result_df_null_cols.columns,
leaf_rotation=90., ax=axes,
color_threshold=2.)
It yields a dendrogram like:
My question is surrounding the y-axis. On all examples I have seen, the Y axis is bound between 0,2 - which I have read to interpret as (1-corr)
. In my result, the boundary is much higher. 0 being items that are highly correlated (1-1 = 0)
, and 2 being the cutoff on lowly correlated stuff (1 - -1 = 2)
.
I found the following answer but it does not agree with this answer and the referenced lecture notes here.
Anyway - hoping someone can clarify which source is the correct one, and help spread some knowledge on the topic.
1) The y-axis is a measure of closeness of either individual data points or clusters. Then, these distances are used to compute the tree, using the following calculation between every pair of clusters.
There are two ways to interpret a dendrogram: in terms of large-scale groups or in terms of similarities among individual chunks. To identify large-scale groups, we start reading from the top down, finding the branch points that are at high levels in the structure.
The horizontal axis represents the clusters. The vertical scale on the dendrogram represent the distance or dissimilarity.
The key to interpreting a hierarchical cluster analysis is to look at the point at which any given pair of cards “join together” in the tree diagram. Cards that join together sooner are more similar to each other than those that join together later.
The metric, which has been used for the linkage()
is the euclidean distance, see here and not the actual values. Therefore, it can go beyond 2 and it purely depends on the type of distance metric we use.
This supports the points, mentioned in this answer.
1) The y-axis is a measure of closeness of either individual data points or clusters.
Then, these distances are used to compute the tree, using the following calculation between every pair of clusters.
From Documentation:
In your mentioned sample,
Even though the individual values does not go beyond (-1, +1)
, we will get the following dendrogram.
from scipy.spatial import distance
distance.pdist(df, 'euclidean')
The reason is that the distance array of size 45 (10 C 2
- every pair of columns; ordering is explained here) would have following values:
array([1.546726 , 0.79914141, 0.79426728, 2.24085106, 2.50838998,
2.22772899, 2.52578923, 2.55978527, 2.51553289, 2.11329023,
2.10501739, 1.66536963, 1.6303103 , 1.71821177, 2.04386712,
2.03917033, 2.03614219, 0.0280283 , 2.33440388, 2.68373496,
2.43771817, 2.68351612, 2.73148741, 2.66843754, 2.31758222,
2.67031469, 2.4206485 , 2.66539997, 2.7134241 , 2.65058045,
1.44756593, 1.39699605, 1.55063416, 1.56324546, 1.52001219,
1.32204039, 1.30206957, 1.29596715, 1.2895916 , 0.65145881,
0.62242858, 0.6283212 , 0.08642582, 0.11145739, 0.14420816])
If we build a random value matrix with uniform dist. (-1, 1)
of size (160, 160)
, the dendrogram would be something similar to this!
Hence, the solution for your problem is,
You need to convert the correlation values into some form of distance measure.
we could use the same squareform() suggested in the other answer. This is a duct tape approach to achieve two aspects of a distance measure. It has to be zero [between same two points] and non-negatives for any two points. This can be achieved by subtracting each corr value from one.
Directly we can use the distance.pdist
function with correlation as metric. The implementation is available here. Remember to transform the dataframe, because we need correlation between each column and not row.
Example to understand the solution:
size = (10000,1)
col1 = np.random.randint(0,100,size) # base column
col2 = col1 * 0.9 + np.random.normal(0,2,size) # huge corr with small noise
col3 = col1 * 0.1 + np.random.normal(0,100,size) # uncorrelated column
col4 = col1 * (-0.5) + np.random.normal(0,1,size) # negatively corr
data = np.hstack((col1,col2,col3,col4))
df = pd.DataFrame(data , columns=list('ABCD'))
df.corr()
A B C D
A 1.000000 0.997042 0.029078 -0.997614
B 0.997042 1.000000 0.029233 -0.994677
C 0.029078 0.029233 1.000000 -0.028421
D -0.997614 -0.994677 -0.028421 1.000000
#pdist_values = distance.squareform(1 - df.corr().values )
pdist_values = distance.pdist(df.T, 'correlation')
z = linkage(pdist_values, method='average')
dendrogram(z, labels=df.columns)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With