Using pretrained glove word embedding with scikit-learn

Tags:

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.

I need to do this in sklearn as well because I am using vecstack to ensemble both keras sequential model and sklearn model.

This is what I have done for keras model:

glove_dir = '/home/Documents/Glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), 'r', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_dim = 200


embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
.
.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
model.compile(----)
model.fit(-----)

I am very new to scikit-learn, from what I have seen to make an model in sklearn you do:

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(x_test)

So, my question is how do I use pre-trained Glove with this model? where do I pass the pre-trained glove embedding_matrix

Thank you very much and I really appreciate your help.

511

asked Mar 16 '19 16:03

BlueMango

1 Answers

You can simply use the Zeugma library.

You can install it with pip install zeugma, then create and train your model with the following lines of code (assuming corpus_train and corpus_test are lists of strings):

from sklearn.linear_model import LogisticRegresion
from zeugma.embeddings import EmbeddingTransformer

glove = EmbeddingTransformer('glove')
x_train = glove.transform(corpus_train)

model = LogisticRegression()
model.fit(x_train, y_train)

x_test = glove.transform(corpus_test)
model.predict(x_test)

You can also use different pre-trained embeddings (complete list here) or train your own (see Zeugma's documentation for how to do this).

114

answered Oct 22 '22 02:10

Wajsbrot

Related questions
                            
                                Wrapping asyncio.gather in a timeout
                            
                                How to add new fields in django user model [closed]
                            
                                Remove certain characters if on end of string in Pandas
                            
                                Can I show feature importance for MultiOutputClassifier?
                            
                                How to get the total number of tests passed, failed and skipped from pytest
                            
                                How to make a tab in dash have a different URL?
                            
                                What is the difference between comma and plus in python print statements? [closed]
                            
                                Change legend location and labels in Seaborn scatter plot
                            
                                google colab python3 name cv2 is not defined
                            
                                Where to install pip packages inside my Conda environment?
                            
                                Django Rest Framework Permissions and Ownership
                            
                                Is there a better way to concatenate continuous string elements in Python?
                            
                                Weird Error When Dividing two numbers in Pandas DataFrame
                            
                                Uses of assigning a class to a variable in Python
                            
                                Keras: how to disable resizing of images when using an ImageDataGenerator with flow_from_dataframe / flow_from_directory?
                            
                                Is there a simple way to print a class' hierarchy in tree form?
                            
                                Limit python script RAM usage in Windows
                            
                                What are response codes for 256 and 512 for os.system in python scripting
                            
                                Numpy: How to index 2d array with 1d array?
                            
                                usage of "dict_value" and "dict_key" in python3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using pretrained glove word embedding with scikit-learn

Tags:

python

keras

scikit-learn

word-embedding

glove

BlueMango

People also ask

1 Answers

Wajsbrot

Recent Activity

Donate For Us