Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load sentences into Python gensim?

Tags:

python

nlp

gensim

I am trying to use the word2vec module from gensim natural language processing library in Python.

The docs say to initialize the model:

from gensim.models import word2vec
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)

What format does gensim expect for the input sentences? I have raw text

"the quick brown fox jumps over the lazy dogs"
"Then a cop quizzed Mick Jagger's ex-wives briefly."
etc.

What additional processing do I need to post into word2fec?


UPDATE: Here is what I have tried. When it loads the sentences, I get nothing.

>>> sentences = ['the quick brown fox jumps over the lazy dogs',
             "Then a cop quizzed Mick Jagger's ex-wives briefly."]
>>> x = word2vec.Word2Vec()
>>> x.build_vocab([s.encode('utf-8').split( ) for s in sentences])
>>> x.vocab
{}
like image 915
john mangual Avatar asked Dec 03 '13 22:12

john mangual


1 Answers

A list of utf-8 sentences. You can also stream the data from the disk.

Make sure it's utf-8, and split it:

sentences = [ "the quick brown fox jumps over the lazy dogs",
"Then a cop quizzed Mick Jagger's ex-wives briefly." ]
word2vec.Word2Vec([s.encode('utf-8').split() for s in sentences], size=100, window=5, min_count=5, workers=4)
like image 144
aIKid Avatar answered Oct 07 '22 00:10

aIKid