<pre class="prettyprint"><code>import pandas as pd import numpy as np rands = np.random.random(7) days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] dates = pd.date_range('2018-01-01', '2018-01-07') df = pd.DataFrame({'date': dates, 'days': days, 'y': rands}) df_days_onehot = pd.get_dummies(df.days)[days] df[days] = df_days_onehot df['target'] = df.y.shift(-1) df.drop('days', axis=1, inplace=True) df.set_index('date', inplace=True) X = df.iloc[:, :-1].values y = df.iloc[:, -1].values </code></pre> I shared a code example above. My question is how should I combine the numerical and the categorical variables as inputs for LSTM ? How should the input vector be like ? <ol> <li>Should it be like [0.123, 0, 1, 0, 0 ...] (like X in the code) dim = (1,8)?</li> <li>Should it be like [0.123, [0, 1, 0, 0...]] dim(1,2)</li> <li>Or is there a specific way/ways to pass inputs to ANNs or RNNs etc. If so, what is it, and why we should use it/them (pros/cons)?</li> </ol> I read things about embedding but the explanations seems not enough for me since I wanted to learn the logic behind all of these. Something like this... <pre class="prettyprint"><code>model = Sequential() model.add(LSTM(64, batch_input_shape=(batch_size, look_back, 1), stateful=True, return_sequences=True)) model.add(Dropout(0.3)) model.add(LSTM(32, batch_input_shape=(batch_size, look_back, 1), stateful=True)) model.add(Dropout(0.3)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer=adam) model.fit(trainX, trainY, epochs=100, batch_size=batch_size, verbose=2, shuffle=False) </code></pre> Any guidence, link, explanation or help will be appriciated... Have a nice day.

There are variety of preprocessing that can be looked at while dealing with input of various ranges in general (like normalization etc). One hot representation is certainly a good way to represent categories. Embeddings are used when there too many category elements which makes one hot encoding very large. They provide a vector representation (potentially trainable ) that encodes a given input. You can read more about them in the link below. Use of Embeddings are very common in NLP. https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12 That aside, you could however take advantage of the fact that Keras modelling supports multiple input layers. For your specific case, here is a made up example that might help you get started. Again, I added few dense hidden layers just to demonstrate the point. It should be self explanatory <pre class="prettyprint"><code>X1 = rands X2 = df_days_onehot Y = np.random.random(7) float_input = Input(shape=(1, )) one_hot_input = Input(shape=(7,) ) first_dense = Dense(3)(float_input) second_dense = Dense(50)(one_hot_input) merge_one = concatenate([first_dense, second_dense]) dense_inner = Dense(10)(merge_one) dense_output = Dense(1)(dense_inner) model = Model(inputs=[float_input, one_hot_input], outputs=dense_output) model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy']) model.summary() model.fit([X1,X2], Y, epochs=2) </code></pre>

How to combine numerical and categorical values in a vector as input for LSTM?

Tags:

python

deep-learning

keras

lstm

categorical-data

import pandas as pd
import numpy as np

rands = np.random.random(7)
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
dates = pd.date_range('2018-01-01', '2018-01-07')

df = pd.DataFrame({'date': dates, 'days': days, 'y': rands})

df_days_onehot = pd.get_dummies(df.days)[days]
df[days] = df_days_onehot
df['target'] = df.y.shift(-1)

df.drop('days', axis=1, inplace=True)
df.set_index('date', inplace=True)

X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

I shared a code example above. My question is how should I combine the numerical and the categorical variables as inputs for LSTM ?

How should the input vector be like ?

Should it be like [0.123, 0, 1, 0, 0 ...] (like X in the code) dim = (1,8)?
Should it be like [0.123, [0, 1, 0, 0...]] dim(1,2)
Or is there a specific way/ways to pass inputs to ANNs or RNNs etc. If so, what is it, and why we should use it/them (pros/cons)?

I read things about embedding but the explanations seems not enough for me since I wanted to learn the logic behind all of these.

Something like this...

model = Sequential()
model.add(LSTM(64, batch_input_shape=(batch_size, look_back, 1), stateful=True, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(32, batch_input_shape=(batch_size, look_back, 1), stateful=True))
model.add(Dropout(0.3))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=adam)
model.fit(trainX, trainY, epochs=100, batch_size=batch_size, verbose=2, shuffle=False)

Any guidence, link, explanation or help will be appriciated... Have a nice day.

361

asked Jul 16 '18 11:07

TheDarkKnight

1 Answers

There are variety of preprocessing that can be looked at while dealing with input of various ranges in general (like normalization etc). One hot representation is certainly a good way to represent categories.

Embeddings are used when there too many category elements which makes one hot encoding very large. They provide a vector representation (potentially trainable ) that encodes a given input. You can read more about them in the link below. Use of Embeddings are very common in NLP.

https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12

That aside, you could however take advantage of the fact that Keras modelling supports multiple input layers.

For your specific case, here is a made up example that might help you get started. Again, I added few dense hidden layers just to demonstrate the point. It should be self explanatory

X1 = rands  
X2 = df_days_onehot
Y = np.random.random(7)

float_input = Input(shape=(1, ))
one_hot_input = Input(shape=(7,) )

first_dense = Dense(3)(float_input)
second_dense = Dense(50)(one_hot_input)

merge_one = concatenate([first_dense, second_dense])
dense_inner = Dense(10)(merge_one)
dense_output = Dense(1)(dense_inner)


model = Model(inputs=[float_input, one_hot_input], outputs=dense_output)


model.compile(loss='mean_squared_error',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

model.fit([X1,X2], Y, epochs=2)

120

answered Sep 29 '22 22:09

user007

Related questions
                            
                                How to use technical indicators of TA-Lib with pandas in python
                            
                                How to send a colored text message?
                            
                                Jupyter: Write a custom magic that modifies the contents of the cell it's in
                            
                                zip_longest without fillvalue
                            
                                How to optimize multiprocessing in Python
                            
                                How to split a list into n groups in all possible combinations of group length and elements within group?
                            
                                Spyder 3 "Set Console Working Directory" not working
                            
                                How do I feed Tensorflow placeholders with numpy arrays?
                            
                                What should I put in the body of an abstract method?
                            
                                What's the difference between dummy variable and one-hot encoding?
                            
                                TypeError: init() missing 1 required positional argument: 'message' using Multiprocessing
                            
                                Pipe PIL images to ffmpeg stdin - Python
                            
                                Python Requests - ChunkedEncodingError(e) - requests.iter_lines
                            
                                pip-selfcheck.json with virtualenv
                            
                                How to generate n-level hierarchical JSON from pandas DataFrame?
                            
                                opencv - cropping handwritten lines (line segmentation)
                            
                                Add top level argparse arguments after subparser args
                            
                                Sklearn: adding lemmatizer to CountVectorizer
                            
                                What exactly is the optimization `functools.partial` is making?
                            
                                Pipenv-Error: ModuleNotFoundError: No module named 'pip._internal'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With