Python: gensim: RuntimeError: you must first build vocabulary before training the model

Question

I know that this question has been asked already, but I was still not able to find a solution for it.

I would like to use gensim's word2vec on a custom data set, but now I'm still figuring out in what format the dataset has to be. I had a look at this post where the input is basically a list of lists (one big list containing other lists that are tokenized sentences from the NLTK Brown corpus). So I thought that this is the input format I have to use for the command word2vec.Word2Vec(). However, it won't work with my little test set and I don't understand why.

What I have tried:

This worked:

from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

brown_vecs = word2vec.Word2Vec(brown.sents())

This didn't work:

sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)

I don't understand why it doesn't work with the "mock" data, even though it has the same data structure as the sentences from the Brown corpus:

vocab:

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

brown.sents(): (the beginning of it)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Can anyone please tell me what I'm doing wrong?

kampta · Accepted Answer

Default min_count in gensim's Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error. Try

voc_vec = word2vec.Word2Vec(vocab, min_count=1)

Akson · Answer

Input to the gensim's Word2Vec can be a list of sentences or list of words or list of list of sentences.

E.g.

1. sentences = ['I love ice-cream', 'he loves ice-cream', 'you love ice cream']
2. words = ['i','love','ice - cream', 'like', 'ice-cream']
3. sentences = [['i love ice-cream'], ['he loves ice-cream'], ['you love ice cream']]

build the vocab before training

model.build_vocab(sentences, update=False)

just check out the link for detailed info

Python: gensim: RuntimeError: you must first build vocabulary before training the model

Tags:

python

gensim

word2vec

user56591

Video Answer

2 Answers

kampta

Akson

Recent Activity

Donate For Us

Python: gensim: RuntimeError: you must first build vocabulary before training the model

Tags:

python

gensim

word2vec

user56591

Video Answer

2 Answers

kampta

Akson

Related questions

Recent Activity

Donate For Us