that might actually be a dumb question but I just can't figure out why my script with gensim.models.word2vec is not working. Here is the thing, I'm using the stanford sentiment analysis databank dataset (~11000 reviews), and i'm trying to build word2vec using gensim, this is my script:
import gensim as gs
import sys
# open the datas
sentences = gs.models.word2vec.LineSentence('../processedWords.txt')
print("size in RAM of the sentences: {}".format(sys.getsizeof(sentences)))
# transform them
# bigram_transformer = gs.models.Phrases(sentences)
model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5)
model.save('firstModel')
print(model.similarity('film', 'test'))
print(model.similarity('film', 'movie'))
Now, my problem is that the script runs in 2s, and gives only huge similarity between every pair of words. In addition, some words which are in the sentences are not in the built vocabulary.
I must be doing something obviously wrong, but can't figure what.
Thank you for your help.
I'm almost certain that this is because you haven't specified a number of training iterations; I think iter defaults to 1, which is basically useless for training a neural net. Add the iter=<int> flag to your model declaration, e.g. model = gs.models.word2vec.Word2Vec(sentences, min_count=10, size=100, window=5, iter=1000).
Kind of a face-palmer but I did the same exact thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With