import pandas as pd
import numpy as np
rands = np.random.random(7)
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
dates = pd.date_range('2018-01-01', '2018-01-07')
df = pd.DataFrame({'date': dates, 'days': days, 'y': rands})
df_days_onehot = pd.get_dummies(df.days)[days]
df[days] = df_days_onehot
df['target'] = df.y.shift(-1)
df.drop('days', axis=1, inplace=True)
df.set_index('date', inplace=True)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
I shared a code example above. My question is how should I combine the numerical and the categorical variables as inputs for LSTM ?
How should the input vector be like ?
I read things about embedding but the explanations seems not enough for me since I wanted to learn the logic behind all of these.
Something like this...
model = Sequential()
model.add(LSTM(64, batch_input_shape=(batch_size, look_back, 1), stateful=True, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(32, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dropout(0.3))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=adam)
model.fit(trainX, trainY, epochs=100, batch_size=batch_size, verbose=2, shuffle=False)
Any guidence, link, explanation or help will be appriciated... Have a nice day.
yes it is possible to combine categorical and continuous variable. These designs are inbuild in many softwares like design expert. Think categorical variables as blocks and you can do it. During analysis you will get two different equation representing each categorical variable.
Binary encoding is a technique used to transform categorical data into numerical data by encoding categories as integers and then converting them into binary code.
We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.
Because neural networks work internally with numeric data, binary data (such as sex, which can be male or female) and categorical data (such as a community, which can be suburban, city or rural) must be encoded in numeric form.
There are variety of preprocessing that can be looked at while dealing with input of various ranges in general (like normalization etc). One hot representation is certainly a good way to represent categories.
Embeddings are used when there too many category elements which makes one hot encoding very large. They provide a vector representation (potentially trainable ) that encodes a given input. You can read more about them in the link below. Use of Embeddings are very common in NLP.
https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12
That aside, you could however take advantage of the fact that Keras modelling supports multiple input layers.
For your specific case, here is a made up example that might help you get started. Again, I added few dense hidden layers just to demonstrate the point. It should be self explanatory
X1 = rands
X2 = df_days_onehot
Y = np.random.random(7)
float_input = Input(shape=(1, ))
one_hot_input = Input(shape=(7,) )
first_dense = Dense(3)(float_input)
second_dense = Dense(50)(one_hot_input)
merge_one = concatenate([first_dense, second_dense])
dense_inner = Dense(10)(merge_one)
dense_output = Dense(1)(dense_inner)
model = Model(inputs=[float_input, one_hot_input], outputs=dense_output)
model.compile(loss='mean_squared_error',
optimizer='adam',
metrics=['accuracy'])
model.summary()
model.fit([X1,X2], Y, epochs=2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With