Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram

I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.

So I have a 1000 numpy.array like so:

[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
 [-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
 ...
 ...[1000 Vectors]
]

And words array like so:

["hello","hi","bye","good_bye"...1000]

I have ran K-Means on my data, and the results I got made sense:

X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
    print(l,words[idx])

--- Output ---
0 hello
0 hi
1 bye
1 good_bye

0 = greeting 1 = farewell

However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.

How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page? enter image description here

like image 438
Shlomi Schwartz Avatar asked Jan 04 '17 11:01

Shlomi Schwartz


People also ask

How dendrogram is used in hierarchical clustering?

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.

How do you interpret the results of hierarchical clustering?

The key to interpreting a hierarchical cluster analysis is to look at the point at which any given pair of cards “join together” in the tree diagram. Cards that join together sooner are more similar to each other than those that join together later.


1 Answers

I had the same problem till now! After finding always your post after searching it online (keyword = hierarchy clustering on word2vec). I had to give you a perhaps valid solution.

sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]

import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

l = linkage(model.wv.syn0, method='complete', metric='seuclidean')

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()
like image 160
Antoine Reinhold Bertrand Avatar answered Sep 20 '22 17:09

Antoine Reinhold Bertrand