We are trying to implement a word vector model for the set of words given below.
stemmed = ['data', 'appli', 'scientist', 'mgr', 'microsoft', 'hire', 'develop', 'mentor', 'team', 'data', 'scientist', 'defin', 'data', 'scienc', 'prioriti', 'deep', 'understand', 'busi', 'goal', 'collabor', 'across', 'multipl', 'group', 'set', 'team', 'shortterm', 'longterm', 'goal', 'act', 'strateg', 'advisor', 'leadership', 'influenc', 'futur', 'direct', 'strategi', 'defin', 'partnership', 'align', 'efficaci', 'broad', 'analyt', 'effort', 'analyticsdata', 'team', 'drive', 'particip', 'data', 'scienc', 'bi', 'commun', 'disciplin', 'microsoftprior', 'experi', 'hire', 'manag', 'run', 'team', 'data', 'scientist', 'busi', 'domain', 'experi', 'use', 'analyt', 'must', 'experi', 'across', 'sever', 'relev', 'busi', 'domain', 'util', 'critic', 'think', 'skill', 'conceptu', 'complex', 'busi', 'problem', 'solut', 'use', 'advanc', 'analyt', 'larg', 'scale', 'realworld', 'busi', 'data', 'set', 'candid', 'must', 'abl', 'independ', 'execut', 'analyt', 'project', 'help', 'intern', 'client', 'understand']
We are using this code:
import gensim
model = gensim.models.FastText(stemmed, size=100, window=5, min_count=1, workers=4, sg=1)
model.wv.most_similar(positive=['data'])
However, we are getting this error:
KeyError: 'all ngrams for word data absent from model'
You need to provide your training data not as a list but rather as a generator.
Try:
import gensim
from gensim.models.fasttext import FastText as FT_gensim
stemmed = ['data', 'appli', 'scientist', ... ]
def gen_words(stemmed):
yield stemmed
model = FT_gensim(size=100, window=5, min_count=1, workers=4, sg=1)
model.build_vocab(gen_words(stemmed))
model.train(gen_words(stemmed), total_examples=model.corpus_count, epochs=model.iter)
model.wv.most_similar(positive=['data'])
This prints out:
[('busi', -0.043828580528497696)]
See also this notebook from the gensim documentation. And this excellent gensim tutorial on all things iterable:
In gensim, it’s up to you how you create the corpus. Gensim algorithms only care that you supply them with an iterable of sparse vectors (and for some algorithms, even a generator = a single pass over the vectors is enough).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With