I am trying to train the word2vec model from gensim using the Italian wikipedia
"http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2"
However, I am not sure what is the best preprocessing for this corpus.
gensim model accepts a list of tokenized sentences.
My first try is to just use the standard WikipediaCorpus preprocessor from gensim. This extract each article, remove punctuation and split words on spaces. With this tool each sentence would correspond to an entire model, and I am not sure of the impact of this fact on the model.
After this I train the model with default parameters. Unfortunately after training it seems that I do not manage to obtain very meaningful similarities.
What is the most appropriate preprocessing on the Wikipedia corpus for this task? (if this questions are too broad please help me by pointing to a relevant tutorial / article )
This the code of my first trial:
from gensim.corpora import WikiCorpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
corpus = WikiCorpus('itwiki-latest-pages-articles.xml.bz2',dictionary=False)
max_sentence = -1
def generate_lines():
    for index, text in enumerate(corpus.get_texts()):
        if index < max_sentence or max_sentence==-1:
            yield text
        else:
            break
from gensim.models.word2vec import BrownCorpus, Word2Vec
model = Word2Vec() 
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
model.train(generate_lines(),chunksize=500)
Your approach is fine.
model.build_vocab(generate_lines()) #This strangely builds a vocab of "only" 747904 words which is << than those reported in the literature 10M words
This could be because of pruning infrequent words (the default is min_count=5).
To speed up computation, you can consider "caching" the preprocessed articles as a plain .txt.gz file, one sentence (document) per line, and then simply using word2vec.LineSentence corpus. This saves parsing the bzipped wiki XML on every iteration.
Why word2vec doesn't produce "meaningful similarities" for Italian wiki, I don't know. English wiki seems to work fine. See also here.
I've been working on a project to massage the wikipedia corpus and get vectors out of it. I might generate the Italian vectors soon but in case you want to do it on your own take a look at: https://github.com/idio/wiki2vec
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With