Gensim Word2Vec: poor training performance.

Question

that might actually be a dumb question but I just can't figure out why my script with gensim.models.word2vec is not working. Here is the thing, I'm using the stanford sentiment analysis databank dataset (~11000 reviews), and i'm trying to build word2vec using gensim, this is my script:

import gensim as gs 
import sys 

# open the datas
sentences = gs.models.word2vec.LineSentence('../processedWords.txt')
print("size in RAM of the sentences: {}".format(sys.getsizeof(sentences)))

# transform them
# bigram_transformer = gs.models.Phrases(sentences)

model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5)
model.save('firstModel')
print(model.similarity('film', 'test'))
print(model.similarity('film', 'movie'))

Now, my problem is that the script runs in 2s, and gives only huge similarity between every pair of words. In addition, some words which are in the sentences are not in the built vocabulary.

I must be doing something obviously wrong, but can't figure what.

Thank you for your help.

Spencer Norris · Accepted Answer

I'm almost certain that this is because you haven't specified a number of training iterations; I think iter defaults to 1, which is basically useless for training a neural net. Add the iter=<int> flag to your model declaration, e.g. model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5, iter=1000). Kind of a face-palmer but I did the same exact thing.

Gensim Word2Vec: poor training performance.

Tags:

python-3.x

dataset

text-mining

gensim

word2vec

Tyrannas

1 Answers

Spencer Norris

Recent Activity

Donate For Us

Gensim Word2Vec: poor training performance.

Tags:

python-3.x

dataset

text-mining

gensim

word2vec

Tyrannas

1 Answers

Spencer Norris

Related questions

Recent Activity

Donate For Us