I've generated a 100D word2vec model using my domain text corpus, merging common phrases, for example (good bye => good_bye). Then I've extracted 1000 vectors of desired words.
So I have a 1000 numpy.array like so:
[[-0.050378,0.855622,1.107467,0.456601,...[100 dimensions],
[-0.040378,0.755622,1.107467,0.456601,...[100 dimensions],
...
...[1000 Vectors]
]
And words array like so:
["hello","hi","bye","good_bye"...1000]
I have ran K-Means on my data, and the results I got made sense:
X = np.array(words_vectors)
kmeans = KMeans(n_clusters=20, random_state=0).fit(X)
for idx,l in enumerate(kmeans.labels_):
print(l,words[idx])
--- Output ---
0 hello
0 hi
1 bye
1 good_bye
0 = greeting 1 = farewell
However, some words made me think that hierarchical clustering is more suitable for the task. I've tried using AgglomerativeClustering, Unfortunately ... for this Python nobee, things got complicated and I got lost.
How can I cluster my vectors, so the output would be a dendrogram, more or less, like the one found on this wiki page?
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters.
The key to interpreting a hierarchical cluster analysis is to look at the point at which any given pair of cards “join together” in the tree diagram. Cards that join together sooner are more similar to each other than those that join together later.
I had the same problem till now! After finding always your post after searching it online (keyword = hierarchy clustering on word2vec). I had to give you a perhaps valid solution.
sentences = ['hi', 'hello', 'hi hello', 'goodbye', 'bye', 'goodbye bye']
sentences_split = [s.lower().split(' ') for s in sentences]
import gensim
model = gensim.models.Word2Vec(sentences_split, min_count=2)
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
l = linkage(model.wv.syn0, method='complete', metric='seuclidean')
# calculate full dendrogram
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('word')
plt.xlabel('distance')
dendrogram(
l,
leaf_rotation=90., # rotates the x axis labels
leaf_font_size=16., # font size for the x axis labels
orientation='left',
leaf_label_func=lambda v: str(model.wv.index2word[v])
)
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With