Get trouble to load glove 840B 300d vector

Tags:

It seems the format is, for every line, the string is like 'word number number .....'. So it easy to split it. But when I split them with the script below

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r')
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

I load the glove 840B 300d.txt. but get error and I print the splitLine I got

['contact', '[email protected]', '0.016426', '0.13728', '0.18781', '0.75784', '0.44012', '0.096794' ... ]

['.', '.', '.', '.', '0.033459', '-0.085658', '0.27155', ...]

Please notice that this script works fine in glove.6b.*

341

asked Mar 03 '18 11:03

Linjie Xu

2 Answers

The code works fine for files: glove.6B.*d.txt, glove.42B.*d.txt, but not glove.6B.300d.txt. This is because glove.6B.300d.txt contains spaces in a word. For example, it has a word like this: '. . .' and there are spaces between those dots. I solve this problem by changing this line:

splitLine = line.split()

into

splitLine = line.split(' ')

So you code must be like this:

import numpy as np
def loadGloveModel(gloveFile):
    print "Loading Glove Model"
    f = open(gloveFile,'r', encoding='utf8')
    model = {}
    for line in f:
        splitLine = line.split(' ')
        word = splitLine[0]
        embedding = np.asarray(splitLine[1:], dtype='float32')
        model[word] = embedding
    print "Done.",len(model)," words loaded!"
    return model

189

answered Sep 17 '22 16:09

Weikai

I think the following may help:

def process_glove_line(line, dim):
    word = None
    embedding = None

    try:
        splitLine = line.split()
        word = " ".join(splitLine[:len(splitLine)-dim])
        embedding = np.array([float(val) for val in splitLine[-dim:]])
    except:
        print(line)

    return word, embedding

def load_glove_model(glove_filepath, dim):
    with open(glove_filepath, encoding="utf8" ) as f:
        content = f.readlines()
        model = {}
        for line in content:
            word, embedding = process_glove_line(line, dim)
            if embedding is not None:
                model[word] = embedding
        return model

model= load_glove_model("glove.840B.300d.txt", 300)

answered Sep 20 '22 16:09

pdhoolia

Related questions
                            
                                How to replace values in multiple categoricals in a pandas DataFrame
                            
                                How can I deserialize a datetime string in celery?
                            
                                python asyncio.Event.wait() not responding to event.set()
                            
                                Queryset: Compare a field with a substring of another field of the same model
                            
                                Pythonic reduce with accumlation and arbitrary lambda function?
                            
                                Is it possible to edit MS word doc files with Python?
                            
                                Plotly Dash Cannot Create Graphs Dynamically
                            
                                Sorting in a Pandas pivot_table
                            
                                Module object has no attribute leaky_relu
                            
                                What is the Rust equivalent of a reverse shell script written in Python?
                            
                                Python 3.Kivy. Is there any way to limit entered text in TextInput widget?
                            
                                Mark a class as abstract without defining any abstract methods
                            
                                Matplotlib 3D: Remove axis ticks & draw upper edge border?
                            
                                Using flask-jwt-extended callbacks with flask-restful and create_app
                            
                                How to loop though range and randomly shuffle a list in Python?
                            
                                Long paths for python on windows - os.stat() fails for relative paths?
                            
                                Why does sys.excepthook behave differently when wrapped?
                            
                                When should I commit with SQLAlchemy using a for loop?
                            
                                How to hash *args **kwargs for function cache?
                            
                                Is there a Pythonic way to close over a loop variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get trouble to load glove 840B 300d vector

Tags:

python

nlp

stanford-nlp

word2vec

Linjie Xu

People also ask

2 Answers

Weikai

pdhoolia

Recent Activity

Donate For Us