Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram

Tags:

I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.

So I have a 1000 numpy.array like so:

[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
 [-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
 ...
 ...[1000 Vectors]
]

And words array like so:

["hello","hi","bye","good_bye"...1000]

I have ran K-Means on my data, and the results I got made sense:

X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
    print(l,words[idx])

--- Output ---
0 hello
0 hi
1 bye
1 good_bye

0 = greeting 1 = farewell

However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.

How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page? enter image description here

438

asked Jan 04 '17 11:01

Shlomi Schwartz

1 Answers

I had the same problem till now! After finding always your post after searching it online (keyword = hierarchy clustering on word2vec). I had to give you a perhaps valid solution.

sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]

import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

l = linkage(model.wv.syn0, method='complete', metric='seuclidean')

# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')

dendrogram(
    l,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=16.,  # font size for the x axis labels
    orientation='left',
    leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()

160

answered Sep 20 '22 17:09

Antoine Reinhold Bertrand

Related questions
                            
                                Select single item in MYSQLdb - Python
                            
                                How to compress a file with bzip2 in Python?
                            
                                Sort lists in a Pandas Dataframe column
                            
                                Python decorator logger
                            
                                Import Error: No module called magic yet python-magic is installed
                            
                                How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas
                            
                                different ylim for shared axes in pandas boxplot
                            
                                How to install scipy on windows 10?
                            
                                selenium wont work with Firefox or Chrome
                            
                                Seaborn's histrogram bin widths not extending to bin labels
                            
                                Passing multiple arguments in Python thread
                            
                                How to use Robust PCA output as principal-component (eigen)vectors from traditional PCA
                            
                                'Proper' rounding in Python, to 3 decimal places
                            
                                Unpacking a list in print for Python 2
                            
                                I cannot close Excel 2016 after executing a xlwings function
                            
                                Why is np.where faster than pd.apply
                            
                                Reshape arbitrary length vector into square matrix with padding in numpy
                            
                                Pyodbc installation error on Ubuntu 16.04 with Sql Server installed
                            
                                Python equivalent to R poly() function?
                            
                                Python - Find line number from text file [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - Calculate Hierarchical clustering of word2vec vectors and plot the results as a dendrogram

Tags:

python

machine-learning

numpy

word2vec

hierarchical-clustering

Shlomi Schwartz

People also ask

1 Answers

Antoine Reinhold Bertrand

Recent Activity

Donate For Us