Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gensim train word2vec on wikipedia - preprocessing and parameters

I am trying to train the word2vec model from gensim using the Italian wikipedia "http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"

However, I am not sure what is the best preprocessing for this corpus.

gensim model accepts a list of tokenized sentences. My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.

After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.

What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )

This the code of my first trial:

from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1

def generate_lines():
    for index, text in enumerate(corpus.get_texts()):
        if index < max_sentence or max_sentence==-1:
            yield text
        else:
            break

from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec() 
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)
like image 298
Luca Fiaschi Avatar asked May 19 '14 10:05

Luca Fiaschi


2 Answers

Your approach is fine.

model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words

This could be because of pruning infrequent words (the default is min_count=5).

To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.

Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.

like image 79
Radim Avatar answered Oct 27 '22 07:10

Radim


I've been working on a project to massage the wikipedia corpus and get vectors out of it. I might generate the Italian vectors soon but in case you want to do it on your own take a look at: https://github.com/idio/wiki2vec

like image 22
David Przybilla Avatar answered Oct 27 '22 07:10

David Przybilla