Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: gensim: RuntimeError: you must first build vocabulary before training the model

I know that this question has been asked already, but I was still not able to find a solution for it.

I would like to use gensim's word2vec on a custom data set, but now I'm still figuring out in what format the dataset has to be. I had a look at this post where the input is basically a list of lists (one big list containing other lists that are tokenized sentences from the NLTK Brown corpus). So I thought that this is the input format I have to use for the command word2vec.Word2Vec(). However, it won't work with my little test set and I don't understand why.

What I have tried:

This worked:

from gensim.models import word2vec
from nltk.corpus import brown
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

brown_vecs = word2vec.Word2Vec(brown.sents())

This didn't work:

sentences = [ "the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]
vocab = [s.encode('utf-8').split() for s in sentences]
voc_vec = word2vec.Word2Vec(vocab)

I don't understand why it doesn't work with the "mock" data, even though it has the same data structure as the sentences from the Brown corpus:

vocab:

[['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dogs'], ['yoyoyo', 'you', 'go', 'home', 'now', 'to', 'sleep']]

brown.sents(): (the beginning of it)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

Can anyone please tell me what I'm doing wrong?

like image 335
user56591 Avatar asked Nov 30 '15 00:11

user56591


Video Answer


2 Answers

Default min_count in gensim's Word2Vec is set to 5. If there is no word in your vocab with frequency greater than 4, your vocab will be empty and hence the error. Try

voc_vec = word2vec.Word2Vec(vocab, min_count=1)
like image 73
kampta Avatar answered Oct 20 '22 15:10

kampta


Input to the gensim's Word2Vec can be a list of sentences or list of words or list of list of sentences.

E.g.

1. sentences = ['I love ice-cream', 'he loves ice-cream', 'you love ice cream']
2. words = ['i','love','ice - cream', 'like', 'ice-cream']
3. sentences = [['i love ice-cream'], ['he loves ice-cream'], ['you love ice cream']]

build the vocab before training

model.build_vocab(sentences, update=False)

just check out the link for detailed info

like image 4
Akson Avatar answered Oct 20 '22 15:10

Akson