Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a Word2Vec model pre-trained on wikipedia

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?

like image 300
Boris Avatar asked Jul 25 '17 17:07

Boris


People also ask

Is Word2Vec a pre trained model?

Word2Vec is one of the most popular pretrained word embeddings developed by Google. Word2Vec is trained on the Google News dataset (about 100 billion words).

How do you train a model using Word2Vec?

Training the networkwe take a training sample and generate the output value of the nework. we evaluate the loss by comparing the model prediction with the true output label. we update weights of the network by using gradient descent technique on the evaluated loss. we then take another sample and start over again.

How are Word2Vec embeddings trained?

Word2vec uses a machine learning logistic regression techniques to train a classifier (log-linear) that distinguishes between positive and negative (true and false) examples. The trained regression weights are used as the word embeddings.

How long does it take to train a Word2Vec model?

To train a Word2Vec model takes about 22 hours, and FastText model takes about 33 hours. If it's too long to you, you can use fewer "iter", but the performance might be worse.


1 Answers

You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details. You'll have to be a bit careful using these models, though. I'm not sure about all of them, but at least in Wikipedia's case, the model is not a binary file that you can straightforwardly load using e.g. gensim's functionality, but a txt version, i.e. file with words and corresponding vectors. Keep in mind, though, that the words are appended by their part-of-speech (POS) tags, so for example, if you'd like to use the model to find out similarities for word vacation, you'll get a KeyError if you type vacation as is, since the model stores this word as vacation_NOUN. An example snippet of how you could use the wiki model (perhaps others as well if they're in the same format) and an output is below

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

and the output

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE Here are some useful links to binary models:

Pretrained word embedding models:

Fasttext models:

  • crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
  • wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
  • wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
  • Wiki word vectors, dim=300: wiki.en.zip: bin+text model

Google Word2Vec

  • Pretrained word/phrase vectors:
    • GoogleNews-vectors-negative300.bin.gz
    • GoogleNews-vectors-negative300-SLIM.bin.gz: slim version with app. 300k words
  • Pretrained entity vectors:
    • freebase-vectors-skipgram1000.bin.gz: Entity vectors trained on 100B words from various news articles
    • freebase-vectors-skipgram1000-en.bin.gz: Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency

GloVe: Global Vectors for Word Representation

  • glove.6B.zip: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download). Here's an example in action.
  • glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

WebVectors

  • models trained on various corpora, augmented by Part-of-Speech (POS) tags
like image 112
formi23 Avatar answered Sep 27 '22 21:09

formi23