Using a Word2Vec model pre-trained on wikipedia

Tags:

I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?

300

asked Jul 25 '17 17:07

Boris

1 Answers

You can check WebVectors to find Word2Vec models trained on various corpora. Models come with readme covering the training details. You'll have to be a bit careful using these models, though. I'm not sure about all of them, but at least in Wikipedia's case, the model is not a binary file that you can straightforwardly load using e.g. gensim's functionality, but a txt version, i.e. file with words and corresponding vectors. Keep in mind, though, that the words are appended by their part-of-speech (POS) tags, so for example, if you'd like to use the model to find out similarities for word vacation, you'll get a KeyError if you type vacation as is, since the model stores this word as vacation_NOUN. An example snippet of how you could use the wiki model (perhaps others as well if they're in the same format) and an output is below

import gensim.models

model = "./WebVectors/3/enwiki_5_ner.txt"

word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model, binary=False)
print(word_vectors.most_similar("vacation_NOUN"))
print(word_vectors.most_similar(positive=['woman_NOUN', 'king_NOUN'], negative=['man_NOUN']))

and the output

▶ python3 wiki_model.py
[('vacation_VERB', 0.6829521656036377), ('honeymoon_NOUN', 0.6811978816986084), ('holiday_NOUN', 0.6588436365127563), ('vacationer_NOUN', 0.6212040781974792), ('resort_NOUN', 0.5720850825309753), ('trip_NOUN', 0.5585346817970276), ('holiday_VERB', 0.5482848882675171), ('week-end_NOUN', 0.5174300670623779), ('newlywed_NOUN', 0.5146450996398926), ('honeymoon_VERB', 0.5135983228683472)]
[('monarch_NOUN', 0.6679952144622803), ('ruler_NOUN', 0.6257176995277405), ('regnant_NOUN', 0.6217397451400757), ('royal_ADJ', 0.6212111115455627), ('princess_NOUN', 0.6133661866188049), ('queen_NOUN', 0.6015778183937073), ('kingship_NOUN', 0.5986001491546631), ('prince_NOUN', 0.5900266170501709), ('royal_NOUN', 0.5886058807373047), ('throne_NOUN', 0.5855424404144287)]

UPDATE Here are some useful links to binary models:

Pretrained word embedding models:

Fasttext models:

crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
Wiki word vectors, dim=300: wiki.en.zip: bin+text model

Google Word2Vec

Pretrained word/phrase vectors:
- GoogleNews-vectors-negative300.bin.gz
- GoogleNews-vectors-negative300-SLIM.bin.gz: slim version with app. 300k words
Pretrained entity vectors:
- freebase-vectors-skipgram1000.bin.gz: Entity vectors trained on 100B words from various news articles
- freebase-vectors-skipgram1000-en.bin.gz: Entity vectors trained on 100B words from various news articles, using the deprecated /en/ naming (more easily readable); the vectors are sorted by frequency

GloVe: Global Vectors for Word Representation

glove.6B.zip: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d, 100d, 200d, & 300d vectors, 822 MB download). Here's an example in action.
glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

WebVectors

models trained on various corpora, augmented by Part-of-Speech (POS) tags

112

answered Sep 27 '22 21:09

formi23

Related questions
                            
                                How to add custom menu item to UITextView menu, which is a link to the Wikipedia page of the selected word?
                            
                                Wikipedia API: how to search for a term in a specific category
                            
                                How to convert Wikipedia wikitable to Python Pandas DataFrame?
                            
                                Why can't I fetch wikipedia pages with LWP::Simple?
                            
                                How to get wikipedia page in multi languages?
                            
                                Wikipedia API: search for famous people
                            
                                Query Wikipedia's API using AJAX (XMLHttpRequest)
                            
                                Wikipedia api fulltext search to return articles with title, snippet and image
                            
                                Reverse wikipedia geotagging lookup
                            
                                Parser for Wikipedia
                            
                                Does the Wikipedia API support searches for a specific template?
                            
                                How to know if the wikipedia content from API contains an useful article or an ambiguous one
                            
                                How does Wikipedia's "What links here" work?
                            
                                Get first lines of Wikipedia Article
                            
                                MediaWiki removed MathJax. Can MathJax be forced on client side another way?
                            
                                Wikipedia Category Hierarchy from dumps
                            
                                How to add a link in MediaWiki VisualEditor Toolbar?
                            
                                Wikipedia API - get random page(s)
                            
                                Sparql Query to get all the possible movies available from dbpedia
                            
                                Freebase / DBpedia / wikidata.org -- differences

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using a Word2Vec model pre-trained on wikipedia

Tags:

wikipedia

gensim

word2vec