Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras Denoising Autoencoder (tabular data)

I have a project where I am doing a regression with Gradient Boosted Trees using tabular data. I want to see if using a denoising autoencoder on my data can find a better representation of my original data and improve my original GBT scores. Inspiration is taken from the popular Kaggle winner here.

AFAIK I have two main choices for extracting the activation's of the DAE - creating a bottleneck structure and taking the single middle layer activations or concatenating every layer's activation's as the representation.

Let's assume I want all layer activations from the 3x 512 node layers below:

inputs = Input(shape=(31,))
encoded = Dense(512, activation='relu')(inputs)
encoded = Dense(512, activation='relu')(encoded)
decoded = Dense(512, activation='relu')(encoded)
decoded = Dense(31, activation='linear')(decoded)
autoencoder = Model(inputs, decoded)
autoencoder.compile(optimizer='Adam', loss='mse')

history = autoencoder.fit(x_train_noisy, x_train_clean,
                epochs=100,
                batch_size=128,
                shuffle=True,
                validation_data=(x_test_noisy, x_test_clean),
                callbacks=[reduce_lr])

My questions are:

  • Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.

  • How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?

  • Do I actually need to provide validation_data= to .fit in this situation?

like image 326
swifty Avatar asked Apr 24 '18 23:04

swifty


2 Answers

Taking the activations of the above will give me a new representation of x_train, right? Should I repeat this process for x_test? I need both to train my GBT model.

Of course, you need to have the denoised representation for both training and testing data, because the GBT model that you train later only accepts the denoised feature.

How can I do inference? Each new data point will need to be "converted" into this new representation format. How can I do that with Keras?

If you want to use the denoised/reconstructed feature, you can directly use autoencoder.predict( X_feat ) to extract features. If you want to use the middle layer, you need to build a new model encoder_only=Model(inputs, encoded) first and use it for feature extraction.

Do I actually need to provide validation_data= to .fit in this situation?

You'd better separate some training data for validation to prevent overfitting. However, you can always train multiple models, e.g. in a leave-one-out way to fully use all data in an ensemble way.

Additional remarks:

  • 512 hidden neurons seems to be too many for your task
  • consider to use DropOut
  • be careful about tabular data, especially when data in different columns are of different dynamic ranges (i.e. MSE does not fairly quantize the reconstruction errors of different columns).
like image 155
pitfall Avatar answered Oct 06 '22 00:10

pitfall


Denoising autoencoder model is a model that can help denoising noisy data. As train data we are using our train data with target the same data.

The model you are describing above is not a denoising autoencoder model. For an autoencoder model, on encoding part, units must gradually be decreased in number from layer to layer thus on decoding part units must gradually be increased in number.

Simple autoencoder model should look like this:

input = Input(shape=(31,))
encoded = Dense(128, activation='relu')(input)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)

decoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(31, activation='sigmoid')(decoded)

autoencoder = Model(input, decoded)
autoencoder.compile(optimizer='adam', loss='mse')

autoencoder.fit(x_train_noisy, x_train_noisy,
                epochs=100,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test_noisy, x_test_noisy))
like image 44
Ioannis Nasios Avatar answered Oct 05 '22 23:10

Ioannis Nasios