Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting one-hot vector from text

In pandas or numpy, I can do the following to get one-hot vectors:

>>> import numpy as np
>>> import pandas as pd
>>> x = [0,2,1,4,3]
>>> pd.get_dummies(x).values
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.]])

>>> np.eye(len(set(x)))[x]
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  1.,  0.]])

From text, with gensim, I can do:

>>> from gensim.corpora import Dictionary
>>> sent1 = 'this is a foo bar sentence .'.split()
>>> sent2 = 'this is another foo bar sentence .'.split()
>>> texts = [sent1, sent2]
>>> vocab = Dictionary(texts)
>>> [[vocab.token2id[word] for word in sent] for sent in texts]
[[3, 4, 0, 6, 1, 2, 5], [3, 4, 7, 6, 1, 2, 5]]

Then I'll have to do the same pd.get_dummies or np.eyes to get the one-hot vector but I get an error where there's one dimension missing from my one-hot vector I have 8 unique words but the one-hot vector lengths are only 7:

>>> [pd.get_dummies(sent).values for sent in texts_idx]
[array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.]]), array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.]])]

It seems like it's doing one-hot vector individually as it iterates through each sentence, instead of using the global vocabulary.

Using np.eye, I do get the right vectors:

>>> [np.eye(len(vocab))[sent] for sent in texts_idx]
[array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]]), array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.]])]

Also, currently, I have to do several things from using gensim.corpora.Dictionary to converting the words to their ids then getting the one-hot vector.

Are there other ways to achieve the same one-hot vector from texts?

like image 402
alvas Avatar asked Apr 21 '16 08:04

alvas


People also ask

What are the limitations of one-hot word embedding?

Problems with One-Hot EncodingEven with relatively small eight dimensions, our example text requires exponentially large memory space. Most of the matrix is taken up by zeros, so useful data becomes sparse. Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.)

What is a Vectorizer in NLP?

Vectorization is the process of converting words into numbers is called Vectorization, It is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which is used to find word predictions, similarities etc.


1 Answers

There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.

Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.

import numpy as np
vocab_size = len(vocab)
text_length = len(text_idx)
one_hot = np.zeros(([vocab_size, text_length])
one_hot[text_idx, np.arange(text_length)] = 1
like image 185
jengel Avatar answered Oct 18 '22 04:10

jengel