Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Don't Understand how to Implement Embeddings for Categorical Features

From various examples I've found online I still don't quite understand how to create embedding layers from my categorical data for neural network models, especially when I have a mix of numerical and categorical data. For example, taking the data set as below.:

numerical_df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=['num_1','num_2','num_3'])

cat_df = pd.DataFrame(np.random.randint(0,5,size=(100, 3)), columns=['cat_1','cat_2','cat_3'])

df = numerical_df.join(cat_df)

I want to create embedding layers for my categorical data and use that in conjunction with my numerical data but from all the examples I've seen its almost like the model just filters the entire dataset through the embedding layer, which is confusing.

As an example of my confusion, below is an example from Keras' documentation on sequential models. It's as though they just add the embedding step as the first layer and fit it to the entirety of x_train.

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Embedding
from keras.layers import LSTM

max_features = 1024

model = Sequential()
model.add(Embedding(max_features, output_dim=256))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(x_train, y_train, batch_size=16, epochs=10)
score = model.evaluate(x_test, y_test, batch_size=16)

So ultimately when it comes to creating embedding matrices, is there one per categorical variable...one for all categorical variables? And how do I reconcile this with my other data that doesn't need an embedding matrix?

like image 284
trystuff Avatar asked Sep 24 '18 19:09

trystuff


1 Answers

To combine the categorical data with numerical data, your model should use multiple inputs using the functional API. One for each categorical variable and one for the numerical inputs. Its up to you how you want to then combine all that data together, but I assume it makes sense to just concatenate everything together and then continue with the rest of your model.

numerical_in = Input(shape=(3,))
cat_in       = Input(shape=(3,))
embed_layer  = Embedding(input_dim=5, output_dim=3, input_length=3)(cat_in)
embed_layer  = Flatten(embed_layer)
merged_layer = concatenate([numerical_in, embed_layer])
output       = rest_of_your_model(merged_layer)
model        = Model(inputs=[numerical_in, cat_in], outputs=[output])

...

model.fit(x=[numerical_df, cat_df], y=[your_expected_out])
like image 186
KevinH Avatar answered Nov 07 '22 19:11

KevinH