It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below
import numpy as np
def loadGloveModel(gloveFile):
print "Loading Glove Model"
f = open(gloveFile,'r')
model = {}
for line in f:
splitLine = line.split()
word = splitLine[0]
embedding = np.array([float(val) for val in splitLine[1:]])
model[word] = embedding
print "Done.",len(model)," words loaded!"
return model
I load the glove 840B 300d.txt. but get error and I print the splitLine I got
['contact', '[email protected]', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]
or
['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]
Please notice that this script works fine in glove.6b.*
To load the pre-trained vectors, we must first create a dictionary that will hold the mappings between words, and the embedding vectors of those words. Assuming that your Python file is in the same directory as the GloVe vectors, we can now open the text file containing the embeddings with: with open("glove. 6B.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. It is developed as an open-source project at Stanford and was launched in 2014.
The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:
splitLine = line.split()
into
splitLine = line.split(' ')
So you code must be like this:
import numpy as np
def loadGloveModel(gloveFile):
print "Loading Glove Model"
f = open(gloveFile,'r', encoding='utf8')
model = {}
for line in f:
splitLine = line.split(' ')
word = splitLine[0]
embedding = np.asarray(splitLine[1:], dtype='float32')
model[word] = embedding
print "Done.",len(model)," words loaded!"
return model
I think the following may help:
def process_glove_line(line, dim):
word = None
embedding = None
try:
splitLine = line.split()
word = " ".join(splitLine[:len(splitLine)-dim])
embedding = np.array([float(val) for val in splitLine[-dim:]])
except:
print(line)
return word, embedding
def load_glove_model(glove_filepath, dim):
with open(glove_filepath, encoding="utf8" ) as f:
content = f.readlines()
model = {}
for line in content:
word, embedding = process_glove_line(line, dim)
if embedding is not None:
model[word] = embedding
return model
model= load_glove_model("glove.840B.300d.txt", 300)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With