Extracting one-hot vector from text

In pandas or numpy, I can do the following to get one-hot vectors:

>>> import numpy as np
>>> import pandas as pd
>>> x = [0,2,1,4,3]
>>> pd.get_dummies(x).values
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.]])

>>> np.eye(len(set(x)))[x]
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.]])

From text, with gensim, I can do:

>>> from gensim.corpora import Dictionary
>>> sent1 = 'this is a foo bar sentence .'.split()
>>> sent2 = 'this is another foo bar sentence .'.split()
>>> texts = [sent1, sent2]
>>> vocab = Dictionary(texts)
>>> [[vocab.token2id[word] for word in sent] for sent in texts]
[[3, 4, 0, 6, 1, 2, 5], [3, 4, 7, 6, 1, 2, 5]]

Then I'll have to do the same pd.get_dummies or np.eyes to get the one-hot vector but I get an error where there's one dimension missing from my one-hot vector I have 8 unique words but the one-hot vector lengths are only 7:

>>> [pd.get_dummies(sent).values for sent in texts_idx]
[array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.]]), array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.]])]

It seems like it's doing one-hot vector individually as it iterates through each sentence, instead of using the global vocabulary.

Using np.eye, I do get the right vectors:

>>> [np.eye(len(vocab))[sent] for sent in texts_idx]
[array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]]), array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]])]

Also, currently, I have to do several things from using gensim.corpora.Dictionary to converting the words to their ids then getting the one-hot vector.

Are there other ways to achieve the same one-hot vector from texts?

What are the limitations of one-hot word embedding?

Problems with One-Hot EncodingEven with relatively small eight dimensions, our example text requires exponentially large memory space. Most of the matrix is taken up by zeros, so useful data becomes sparse. Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.)

What is a Vectorizer in NLP?

Vectorization is the process of converting words into numbers is called Vectorization, It is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which is used to find word predictions, similarities etc.

There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.

import numpy as np
vocab_size = len(vocab)
text_length = len(text_idx)
one_hot = np.zeros(([vocab_size, text_length])
one_hot[text_idx, np.arange(text_length)] = 1

Extracting one-hot vector from text

Tags:

python

pandas

vector

numpy

nlp

alvas

People also ask

1 Answers

jengel

Recent Activity

Donate For Us

Extracting one-hot vector from text

Tags:

python

pandas

vector

numpy

nlp

alvas

People also ask

1 Answers

jengel

Related questions

Recent Activity

Donate For Us