In pandas
or numpy
, I can do the following to get one-hot vectors:
>>> import numpy as np
>>> import pandas as pd
>>> x = [0,2,1,4,3]
>>> pd.get_dummies(x).values
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
>>> np.eye(len(set(x)))[x]
array([[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.]])
From text, with gensim
, I can do:
>>> from gensim.corpora import Dictionary
>>> sent1 = 'this is a foo bar sentence .'.split()
>>> sent2 = 'this is another foo bar sentence .'.split()
>>> texts = [sent1, sent2]
>>> vocab = Dictionary(texts)
>>> [[vocab.token2id[word] for word in sent] for sent in texts]
[[3, 4, 0, 6, 1, 2, 5], [3, 4, 7, 6, 1, 2, 5]]
Then I'll have to do the same pd.get_dummies
or np.eyes
to get the one-hot vector but I get an error where there's one dimension missing from my one-hot vector I have 8 unique words but the one-hot vector lengths are only 7:
>>> [pd.get_dummies(sent).values for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0.]]), array([[ 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0.]])]
It seems like it's doing one-hot vector individually as it iterates through each sentence, instead of using the global vocabulary.
Using np.eye
, I do get the right vectors:
>>> [np.eye(len(vocab))[sent] for sent in texts_idx]
[array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]]), array([[ 0., 0., 0., 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0., 1.],
[ 0., 0., 0., 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 1., 0., 0.]])]
Also, currently, I have to do several things from using gensim.corpora.Dictionary
to converting the words to their ids then getting the one-hot vector.
Are there other ways to achieve the same one-hot vector from texts?
Problems with One-Hot EncodingEven with relatively small eight dimensions, our example text requires exponentially large memory space. Most of the matrix is taken up by zeros, so useful data becomes sparse. Imagine we have a vocabulary of 50,000. (There are roughly a million words in English language.)
Vectorization is the process of converting words into numbers is called Vectorization, It is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which is used to find word predictions, similarities etc.
There are various packages that will do all the steps in a single function such as http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html.
Alternatively, if you have your vocabulary and text indexes for each sentence already, you can create a one-hot encoding by preallocating and using smart indexing. In the following text_idx is a list of integers and vocab is a list relating integers indexes to words.
import numpy as np
vocab_size = len(vocab)
text_length = len(text_idx)
one_hot = np.zeros(([vocab_size, text_length])
one_hot[text_idx, np.arange(text_length)] = 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With