Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't load glove.6B.300d.txt

Tags:

word2vec

I'm trying to load glove vectors, with the following code

en_model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=False)

and I unexpectedly get the following error.

 File "/home/k/Desktop/Work/Vector explorer/word2vec-explorer/vec_test_loader.py", line 55, in make_model
en_model = KeyedVectors.load_word2vec_format(model_path, binary=is_bin)
 File "/home/k/.local/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 1119, in load_word2vec_format
limit=limit, datatype=datatype)
 File "/home/k/.local/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 175, in _load_word2vec_format
vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
 File "/home/k/.local/lib/python3.5/site-packages/gensim/models/utils_any2vec.py", line 175, in <genexpr>
vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format

ValueError: invalid literal for int() with base 10: 'the'

Can someone help?

like image 628
Kalpit Avatar asked Dec 04 '25 23:12

Kalpit


1 Answers

Gensim need more information about model_path, we have to append two number at the first line which the first indicates how many numbers of words vocabulary we have and the second indicates the number of dimension of word embedding, it looks like below:

101 300
the 1.0 2.1 -1.3 ...
I   1.1 0.2 -0.3 ...
.
.
.

you can try to use one line code as below:

python -m gensim.scripts.glove2word2vec --input  glove.840B.300d.txt --output glove.840B.300d.w2vformat.txt

Or you can use my code as reference below:

import gensim
import os
import shutil
import hashlib
from sys import platform

def getFileLineNums(filename):
    f = open(filename, 'r')
    count = 0
    for line in f:
        count += 1
    return count


def prepend_line(infile, outfile, line):
    with open(infile, 'r') as old:
        with open(outfile, 'w') as new:
            new.write(str(line) + "\n")
            shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
    with open(infile, 'r') as fin:
        with open(outfile, 'w') as fout:
            fout.write(line + "\n")
            for line in fin:
                fout.write(line)

def load(filename):
    num_lines = getFileLineNums(filename)
    gensim_file = 'glove_model.txt'
    gensim_first_line = "{} {}".format(num_lines, 300)
    # Prepends the line.
    if platform == "linux" or platform == "linux2":
        prepend_line(filename, gensim_file, gensim_first_line)
    else:
        prepend_slow(filename, gensim_file, gensim_first_line)

    model = gensim.models.KeyedVectors.load_word2vec_format(gensim_file)
    return model
model = load(your_model_path)
like image 99
Joe Zhow Avatar answered Dec 07 '25 21:12

Joe Zhow



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!