Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete words in self-trained word2vec model

I got a self-trained word2vec model (2G, end with ".model"). I convert the model into a text file (over 50G, end with ".txt") because I have to use the text file in my other python codes. I am trying to reduce the size of the text file by deleting words that I do not need. I have built up a vocabulary set with all the words I need. How can I filter unnecessary words in the model?

I have tried to build a dictionary for the text file, but I am out of RAM.

emb_dict = dict()
with open(emb_path, "r", encoding="utf-8") as f:
    lines = f.readlines()
    for l in lines:
        word, embedding = l.strip().split(' ',1)
        emb_dict[word] = embedding

I am thinking if I can delete words in the ".model" file. How can I do it? Any help would be appreciated!

like image 481
Sirui Li Avatar asked Oct 19 '25 17:10

Sirui Li


2 Answers

It's hard to answer further without more precise code but you could batch your analysis of the text file

lines_to_keep = []
new_file = "some_path.txt"
words_to_keep = set(some_words)
with open(emb_path, "r", encoding="utf-8") as f:
    for l in f:
        word, embedding = l.strip().split(' ',1)
        if word in words_to_keep:
            lines_to_keep.append(l.strip())
        if lines_to_keep and len(lines_to_keep) % 1000 == 0:
            with open(new_file, "a") as f:
                f.write("\n".join(lines_to_keep)
            lines_to_keep = []
like image 52
ted Avatar answered Oct 21 '25 08:10

ted


Usually the best way to keep a word2vec model size down is to discard more of the less-frequent words that appeared in the original training corpus.

Words with only a few mentions tend to not get very good word-vectors anyway, and throwing out lots of the few-occurrence words usually has the beneficial side-effect of making the remaining word-vectors better.

If you're using the gensim Word2Vec class, two alternate ways to do this, pre-training, are:

  • Use a larger min_count value.
  • Specify a max_final_vocab count - no more than than exactly that count of words will be kept by the model.

After training, with a set of vectors that were already saved with .save_word2vec_format(), you could re-load them using the limit parameter (to only load the leading, most-frequent words), then re-save. For example:

from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False, limit=500000)
w2v_model.save_word2vec_format(somevecs_filename, binary=False)

Alternatively, if you had a list_of_words_to_keep, you could load the full-file (no limit, assuming you have enough RAM), but then thin-out the model's .vocab dictionary before re-saving. For example:

from gensim.models import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format(allvecs_filename, binary=False)
vocab_set = set(w2v_model.vocab.keys())
keep_set = set(list_of_words_to_keep)
drop_set = vocab_set - keep_set
for word in drop_set:
    del w2v_model.vocab[word]
w2v_model.save_word2vec_format(somevecs_filename, binary=False)
like image 24
gojomo Avatar answered Oct 21 '25 07:10

gojomo