Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim Word2Vec: poor training performance.

that might actually be a dumb question but I just can't figure out why my script with gensim.models.word2vec is not working. Here is the thing, I'm using the stanford sentiment analysis databank dataset (~11000 reviews), and i'm trying to build word2vec using gensim, this is my script:

import gensim as gs 
import sys 

# open the datas
sentences = gs.models.word2vec.LineSentence('../processedWords.txt')
print("size in RAM of the sentences: {}".format(sys.getsizeof(sentences)))

# transform them
# bigram_transformer = gs.models.Phrases(sentences)

model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5)
model.save('firstModel')
print(model.similarity('film', 'test'))
print(model.similarity('film', 'movie'))

Now, my problem is that the script runs in 2s, and gives only huge similarity between every pair of words. In addition, some words which are in the sentences are not in the built vocabulary.

I must be doing something obviously wrong, but can't figure what.

Thank you for your help.

like image 859
Tyrannas Avatar asked Feb 20 '26 01:02

Tyrannas


1 Answers

I'm almost certain that this is because you haven't specified a number of training iterations; I think iter defaults to 1, which is basically useless for training a neural net. Add the iter=<int> flag to your model declaration, e.g. model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5, iter=1000). Kind of a face-palmer but I did the same exact thing.

like image 123
Spencer Norris Avatar answered Feb 21 '26 16:02

Spencer Norris