I have a semi-structured dataset, each row pertains to a single user:
id, skills
0,"java, python, sql"
1,"java, python, spark, html"
2, "business management, communication"
Why semi-structured is because the followings skills can only be selected from a list of 580 unique values.
My goal is to cluster users, or find similar users based on similar skillsets. I have tried using a Word2Vec model, which gives me very good results to identify similar skillsets - For eg.
model.most_similar(["Data Science"])
gives me -
[('Data Mining', 0.9249375462532043),
('Data Visualization', 0.9111810922622681),
('Big Data', 0.8253220319747925),...
This gives me a very good model for identifying individual skills and not group of skills. how do I make use of the vector provided from the Word2Vec model to successfully cluster groups of similar users?
You need to vectorize you strings using your Word2Vec model. You can make it possible like this:
model = KeyedVectors.load("path/to/your/model")
w2v_vectors = model.wv.vectors # here you load vectors for each word in your model
w2v_indices = {word: model.wv.vocab[word].index for word in model.wv.vocab} # here you load indices - with whom you can find an index of the particular word in your model
Then you can use is in this way:
def vectorize(line):
words = []
for word in line: # line - iterable, for example list of tokens
try:
w2v_idx = w2v_indices[word]
except KeyError: # if you does not have a vector for this word in your w2v model, continue
continue
words.append(w2v_vectors[w2v_idx])
if words:
words = np.asarray(words)
min_vec = words.min(axis=0)
max_vec = words.max(axis=0)
return np.concatenate((min_vec, max_vec))
if not words:
return None
Then you receive a vector, which represents your line (document, etc).
After you received all your vectors for each of the lines, you need to cluster, you can use DBSCAN from sklearn for clustering.
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(metric='cosine', eps=0.07, min_samples=3) # you can change these parameters, given just for example
cluster_labels = dbscan.fit_predict(X) # where X - is your matrix, where each row corresponds to one document (line) from the docs, you need to cluster
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With