Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I train gensim on Brown corpus

Tags:

python

gensim

I am trying to use gensim word2vec. I am unable to train the model based on Brown Corpus. Here is my code.

from gensim import models

model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
model.save("E:\\data.bin")

I downloaded nltk_data using nltk.download(). I am getting the error below.

C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py:401: UserWarning: Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`
  warnings.warn("Cython compilation failed, training will be slow. Do you have Cython installed? `pip install cython`")
Traceback (most recent call last):
  File "E:\eclipse_workspace\Python_files\Test\Test.py", line 8, in <module>
    model = models.Word2Vec([sentence for sentence in models.word2vec.BrownCorpus("E:\\nltk_data\\")],workers=4)
  File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 276, in __init__
    self.train(sentences)
  File "C:\Python27\lib\site-packages\gensim-0.10.1-py2.7.egg\gensim\models\word2vec.py", line 407, in train
    raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model

What am I doing wrong?

like image 269
WannaBeCoder Avatar asked Dec 14 '22 17:12

WannaBeCoder


2 Answers

Maybe you create the sentences in the wrong way.
Try this, it works for me.

import gensim
import logging
from nltk.corpus import brown    

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = brown.sents()
model = gensim.models.Word2Vec(sentences, min_count=1)
model.save('/tmp/brown_model')

The logging part is not necessary, and you can change the params in Word2Vec() as you own need.

like image 160
JasonWayne Avatar answered Jan 02 '23 00:01

JasonWayne


You need the full directory path, - not just the nltk_data directory. On my system it would be:

from os.path import expanduser, join
from gensim.models.word2vec import BrownCorpus, Word2Vec

dirname = expanduser(join('~', 'nltk_data', 'corpora', 'brown'))
model = Word2Vec(BrownCorpus(dirname))

model.similar_by_word('house/nn')

Gives:

[(u'room/nn', 0.9538693428039551), (u'door/nn', 0.9475813508033752), ...

Note that the Brown Corpus in NLTK comes with POS-tags. The Gensim BrownCorpus class ignores non-alphabetic tokens but otherwise retains the POS-tags. With nltk.corpus.brown.sents() you get the sentences without the POS-tags.

like image 23
Finn Årup Nielsen Avatar answered Jan 01 '23 23:01

Finn Årup Nielsen