I have been looking at autoencoders and have been wondering whether to used tied weights or not. I intend on stacking them as a pretraining step and then using their hidden representations to feed a NN.
Using untied weights it would look like:
f(x)=σ2(b2+W2*σ1(b1+W1*x))
Using tied weights it would look like:
f(x)=σ2(b2+W1T*σ1(b1+W1*x))
From a very simplistic view, could one say that tying the weights ensures that encoder part is generating the best representation given the architecture versus if the weights were independent then decoder could effectively take a non-optimal representation and still decode it?
I ask because if the decoder is where the "magic" occurs and I intend to only use the encoder to drive my NN, wouldn't that be problematic.
Bottleneck: It is the lower dimensional hidden layer where the encoding is produced. The bottleneck layer has a lower number of nodes and the number of nodes in the bottleneck layer also gives the dimension of the encoding of the input.
The weight matrix of the decoding stage is the transpose of weight matrix of the encoding stage in order to reduce the number of parameters to learn. We want to optimize W , b , and b so that the reconstruction is as similar to the original input as possible with respect to some loss function.
Autoencoder can improve learning accuracy with regularization, which can be a sparsity regularizer, either a contractive regularizer [5], or a denoising form of regularization [6]. Recent work [7] has shown that regularization can be used to prevent feature co-adaptation by dropout training.
A Sparse Autoencoder is a type of autoencoder that employs sparsity to achieve an information bottleneck. Specifically the loss function is constructed so that activations are penalized within a layer.
Autoencoders with tied weights have some important advantages :
But of course - they're not perfect : they may not be optimal when your data comes from highly nolinear manifold. Depending on size of your data I would try both approaches - with tied weights and not if it's possible.
UPDATE :
You asked also why representation which comes from autoencoder with tight weights might be better than one without. Of course it's not the case that such representation is always better but if the reconstruction error is sensible then different units in coding layer represents something which might be considered as generators of perpendicular features which are explaining the most of the variance in data (exatly like PCAs do). This is why such representation might be pretty useful in further phase of learning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With