Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do Text classification using word2vec

I want to perform text classification using word2vec. I got vectors of words.

ls = []
sentences = lines.split(".")
for i in sentences:
    ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)
data

output:

array([[ 0.00933912,  0.07960335, -0.04559333,  0.10600036],
       [ 0.10576613,  0.07267512, -0.10718666, -0.00804013],
       [ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
       [-0.09893986,  0.01500741, -0.04796079, -0.04447284],
       [ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
       [ 0.09352681, -0.03864434, -0.01743148,  0.11251986],.....])

How can i perform classification (product & non product)?

like image 871
Shubham Agrawal Avatar asked Apr 04 '18 06:04

Shubham Agrawal


2 Answers

You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.

You can see an example here using Python3:

import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression


#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())

train = []
#getting only the first 4 columns of the file 
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
    train.extend(sentences)
  
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]

Now it's time to use the vector model, in this example we will calculate the LogisticRegression.

# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)

# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....

# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)

Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
    for line in f:
        lastchar = line.strip()[-1]
        if lastchar.isdigit():
            result = int(lastchar) 
            Y_dataset.append(result) 
        else:
            result = 40 


clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])

# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)

You can also calculate the similarity of words belonging to your created model dictionary:

print("\n\nSimilarity value : ",model.wv.similarity('women','men'))

You can find more functions to use here.

like image 133
Joel Carneiro Avatar answered Sep 20 '22 07:09

Joel Carneiro


Your question is rather broad but I will try to give you a first approach to classify text documents.

First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.

def document_vector(array_of_word_vectors):
    return array_of_word_vectors.mean(axis=0) 

where array_of_word_vectors is for example data in your code.

Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.

The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

like image 32
Jérôme Bau Avatar answered Sep 22 '22 07:09

Jérôme Bau