Gensim train word2vec on wikipedia - preprocessing and parameters

Question

I am trying to train the word2vec model from gensim using the Italian wikipedia "http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"

However, I am not sure what is the best preprocessing for this corpus.

gensim model accepts a list of tokenized sentences. My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.

After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.

What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )

This the code of my first trial:

from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1

def generate_lines():
    for index, text in enumerate(corpus.get_texts()):
        if index < max_sentence or max_sentence==-1:
            yield text
        else:
            break

from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec() 
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)

Radim · Accepted Answer

Your approach is fine.

model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words

This could be because of pruning infrequent words (the default is min_count=5).

To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.

Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.

David Przybilla · Answer

I've been working on a project to massage the wikipedia corpus and get vectors out of it. I might generate the Italian vectors soon but in case you want to do it on your own take a look at: https://github.com/idio/wiki2vec

Gensim train word2vec on wikipedia - preprocessing and parameters

Tags:

nlp

gensim

word2vec

Luca Fiaschi

2 Answers

Radim

David Przybilla

Recent Activity

Donate For Us

Gensim train word2vec on wikipedia - preprocessing and parameters

Tags:

nlp

gensim

word2vec

Luca Fiaschi

2 Answers

Radim

David Przybilla

Related questions

Recent Activity

Donate For Us