Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get trouble to load glove 840B 300d vector

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

I load the glove 840B 300d.txt. but get error and I print the splitLine I got

['contact', '[email protected]', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]

or

['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]

Please notice that this script works fine in glove.6b.*

like image 341
Linjie Xu Avatar asked Mar 03 '18 11:03

Linjie Xu


People also ask

How do you load GloVe vectors?

To load the pre-trained vectors, we must first create a dictionary that will hold the mappings between words, and the embedding vectors of those words. Assuming that your Python file is in the same directory as the GloVe vectors, we can now open the text file containing the embeddings with: with open("glove. 6B.

How are GloVe vectors trained?

Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford and was launched in 2014.


2 Answers

The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:

splitLine = line.split()

into

splitLine = line.split(' ')

So you code must be like this:

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model
like image 189
Weikai Avatar answered Sep 17 '22 16:09

Weikai


I think the following may help:

def process_glove_line(line, dim):
    word = None
    embedding = None

    try:
        splitLine = line.split()
        word = " ".join(splitLine[:len(splitLine)-dim])
        embedding = np.array([float(val) for val in splitLine[-dim:]])
    except:
        print(line)

    return word, embedding

def load_glove_model(glove_filepath, dim):
    with open(glove_filepath, encoding="utf8" ) as f:
        content = f.readlines()
        model = {}
        for line in content:
            word, embedding = process_glove_line(line, dim)
            if embedding is not None:
                model[word] = embedding
        return model

model= load_glove_model("glove.840B.300d.txt", 300)
like image 29
pdhoolia Avatar answered Sep 20 '22 16:09

pdhoolia