Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disabling Gensim's removal of punctuation etc. when parsing a wiki corpus

I want to train a word2vec model on the english wikipedia using python with gensim. I closely followed https://groups.google.com/forum/#!topic/gensim/MJWrDw_IvXw for that.

It works for me but what I don't like about the resulting word2vec model is that named entities are split which makes the model unusable for my specific application. The model I need has to represent named entities as a single vector.

Thats why I planned to parse the wikipedia articles with spacy and merge entities like "north carolina" into "north_carolina", so that word2vec would represent them as a single vector. So far so good.

The spacy parsing has to be part of the preprocessing, which I originally did as recommended in the linked discussion using:

...
wiki = WikiCorpus(wiki_bz2_file, dictionary={})
for text in wiki.get_texts():
    article = " ".join(text) + "\n"
    output.write(article)
...

This removes punctuation, stop words, numbers and capitalization and saves each article in a separate line in the resulting output file. The problem is that spacy's NER doesn't really work on this preprocessed text, since I guess it relies on punctuation and capitalization for NER (?).

Does anyone know if I can "disable" gensim's preprocessing so that it doesn't remove punctuation etc. but still parses the wikipedia articles to text directly from the compressed wikipedia dump? Or does someone know a better way to accomplish this? Thanks in advance!

like image 923
marlonfl Avatar asked Oct 30 '22 08:10

marlonfl


1 Answers

I wouldn't be surprised if spacy was operating on the level of sentences. For that it is very likely using sentence boundaries (dot, question mark, etc.). That is why spacy NER (or maybe even a POS Tagger earlier in the pipeline) might be failing for you.

As for the way to represent named entities for gensim's LSI - I would recommend adding an artificial identifier (a non-existent word). From the perspective of a model it does not make any difference and it may save you the burden of reworking gensim's preprocessing.

You may want to refer to the model.wv.vocab where model = gensim.models.Word2Vec(...) For that you would have to train the model twice. Alternatively, try creating a vocabulary set from the raw text and pick a random set of letters that does not exist already in the vocabulary.

like image 134
sophros Avatar answered Nov 15 '22 05:11

sophros