Create model using one - hot encoding in Keras

Tags:

I am working on a sentence classification problem and try to solve using Keras. The total unique words in the vocabulary is 36.

In this case, the total vocab is [W1,W2,W3....W36]

So, if I have a sentence with words as [W1 W2 W6 W7 W9], if I encode it, I get a numpy array which is like below

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

and the shape is (5,36)

I am stuck from here. All, I have generated is 20000 numpy arrays with varying shapes i.e. (N,36) Where N is the number of words in a sentence. So, I have 20,000 sentences for training and 100 for test and all the sentences are labelled with (1,36) one-hot encoding

I have x_train, x_test, y_train and y_test

x_test and y_test are of dimension (1,36)

Can anyone please advise how do I do it?

I did some of the below coding

model = Sequential()
model.add(Dense(512, input_shape=(??????))),
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

Any help would be much appreciated.

UPDATE and RESPONSE TO @putonspectacles

Thank you very much for the time and effort for the detailed response. I tried your code with some minor modification which I believe needs to be done for the code to work. Please find it below

num_classes = 5 
max_words = 20
sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)
words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
#-Changed the below line the inner for loop sent to sent.split()  
sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]
vocab_size = len(words)
print(vocab_size)
#-changed the below line - the outer for loop sentences to sent_ints
X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1)  for sent in sent_ints])
print(X)
print(y)
model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
      optimizer='adam',
      metrics=['accuracy'])
model.fit(X,y)

Without these changes the code doesnt work. When I run the above code, I get proper embeddings printed like below

[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]


[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]]



[[0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]]

But the error I am getting is "Error when checking input: expected dense_44_input to have 3 dimensions, but got array with shape (3, 1, 20, 16)"

When I change the input shape to model.add(Dense(512, input_shape=(None,max_words, vocab_size + 1)))

I get the error "Input 0 is incompatible with layer lstm_27: expected ndim=3, found ndim=4"

I am working on resolving this issue. If you can give me a direction, that would be great.

I have accepted the answer because it answers the objective of embedding the words. Thanks again.

220

asked Apr 02 '18 03:04

Timothy Rajan

1 Answers

Cool, you cleaned up the question. You want to classify a sentence. I am assuming you said I want to do better than the bag-of-words encoding. You want to place importance on the sequence.

We'll choose a new model then -- an RNN (the LSTM version). This model effectively sums over the importance of each word ( in sequence ) as it builds up a representation of the sentence that best fits the task.

But we're going to have to handle the preprocessing a bit differently. For efficiency ( so that we can process more sentences together in a batch as opposed to single sentence at a time) we want all sentences to have the same amount of words. So We choose a max_words, say 20 and we pad shorter sentences to reach the max words and and we cut sentences longer than 20 words down.

Keras is going to help with that. We'll encode every word with a integer.

from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

num_classes = 5 
max_words = 20
sentences = ["The cat is in the house",
                           "The green boy",
            "computer programs are not alive while the children are"]
labels = np.random.randint(0, num_classes, 3)
y = to_categorical(labels, num_classes=num_classes)

words = set(w for sent in sentences for w in sent.split())
word_map = {w : i+1 for (i, w) in enumerate(words)}
sent_ints = [[word_map[w] for w in sent] for sent in sentences]
vocab_size = len(words)

So "the green boy" might be [1, 3, 5] now. Then we'll pad and one-hot encode with

# pad to max_words length and encode with len(words) + 1  
# + 1 because we'll reserve 0 add the padding sentinel.
X = np.array([to_categorical(pad_sequences((sent,), max_words),  
       vocab_size + 1)  for sent in sent_ints])
print(X.shape) # (3, 20, 16)

Now to the model: we'll add a Dense layer to convert those one hot words to dense vectors. And then we use an LSTM to convert word vectors in sentence to a dense sentence vector. And finally we'll use softmax activation to produce a probability distribution over the classes.

model = Sequential()
model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))
model.add(LSTM(128))
model.add(Dense(5, activation='softmax'))
model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

That should complle. You can then carry on with training.

model.fit(X,y)

EDIT:

this line:

# we need to split the sentences in a words write now it reading every
# letter notice the sent.split() in the correct version below.
sent_ints = [[word_map[w] for w in sent] for sent in sentences]

should be:

sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]

165

answered Oct 19 '22 13:10

parsethis

Related questions
                            
                                Speed up Pandas on Multi-core machine
                            
                                pandas interpolate with nearest for non-numeric values
                            
                                In SQLAlchemy, is a filter applied before or after a join?
                            
                                Load Custom NER Model Stanford CoreNLP
                            
                                Traceback (most recent call last): Adafruit BME 280 Sensor
                            
                                OrderedDict in python 3 - how to get the keys in order?
                            
                                How can I identify the images with 'Possibly corrupt EXIF data'
                            
                                How to use asyncio event loop in library function
                            
                                Efficiently using Numpy to assign function values to array
                            
                                How to update Celery Task ETA?
                            
                                Django model fields from constants?
                            
                                How fit_intercept parameter impacts linear regression with scikit learn
                            
                                How to get data from csv into a python object
                            
                                Python - Confused about inheritance
                            
                                Broken symmetry of operations between Boolean pandas.Series with unequal index
                            
                                How to convert a compiled protocol buffer back to .proto file?
                            
                                Best practice for structuring module exceptions in Python3
                            
                                Changing the attributes of the what appears when hovering over a Choropleth Map in plotly
                            
                                Creating a tree without duplicates
                            
                                all() vs for loop with break performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create model using one - hot encoding in Keras

Tags:

python-3.x

artificial-intelligence

tensorflow

keras

keras-2

Timothy Rajan

People also ask

1 Answers

parsethis

Recent Activity

Donate For Us