Adding Attention on top of simple LSTM layer in Tensorflow 2.0

Tags:

I have a simple network of one LSTM and two Dense layers as such:

model = tf.keras.Sequential()
model.add(layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(layers.Dense(20, activation='sigmoid'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error')

It is training on data with 3 inputs (normalized 0 to 1.0) and 1 output (binary) for the purpose of classification. The data is time series data where there is a relation between time steps.

    var1(t)   var2(t)   var3(t)  var4(t)
0  0.448850  0.503847  0.498571      0.0
1  0.450992  0.503480  0.501215      0.0
2  0.451011  0.506655  0.503049      0.0

The model is trained as such:

history = model.fit(train_X, train_y, epochs=2800, batch_size=40, validation_data=(test_X, test_y), verbose=2, shuffle=False)
model.summary()

Giving the model summary:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 20)                1920      
_________________________________________________________________
dense (Dense)                (None, 20)                420       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 21        
=================================================================
Total params: 2,361
Trainable params: 2,361
Non-trainable params: 0

The model works reasonably well. Now I am trying to replace the Dense(20) layer with an Attention layer. All the examples, tutorials, etc. online (including the TF docs) are for seq2seq models with an embedding layer at the input layer. I understand the seq2seq implementations in TF v1.x but I cannot find any documentation for what I am trying to do. I believe in the new API (v2.0) I need to do something like this:

lstm = layers.LSTM(20, input_shape=(train_X.shape[1], train_X.shape[2]), return_sequences=True)
lstm = tf.keras.layers.Bidirectional(lstm)
attention = layers.Attention() # this does not work

model = tf.keras.Sequential()
model.add(lstm)
model.add(attention)
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='mean_squared_error')

And of course I get the error "Attention layer must be called on a list of inputs, namely [query, value] or [query, value, key]"

I do not understand the solution to this in version (2.0) and for this case (time series data with fixed length input). Any ideas on adding attention to this type of problem is welcome.

623

asked Nov 21 '19 03:11

greco.roamin

1 Answers

I eventually found two answers to the problem, both from libraries on pypi.org. The first is self-attention and can be implemented with Keras (the pre TF 2.0 integrated version of Keras) as follows...

        model = keras.models.Sequential()
        model.add(keras.layers.LSTM(cfg.LSTM, input_shape=(cfg.TIMESTEPS,
                  cfg.FEATURES),
                  return_sequences=True))
        model.add(SeqSelfAttention(attention_width=cfg.ATTNWIDTH,
                attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
                attention_activation='softmax',
                name='Attention'))
        model.add(keras.layers.Dense(cfg.DENSE))
        model.add(keras.layers.Dense(cfg.OUTPUT, activation='sigmoid'))

The second way to do it is a more general solution that works with the post TF 2.0 integrated Keras as follows...

        model = tf.keras.models.Sequential()
        model.add(layers.LSTM(cfg.LSTM, input_shape=(cfg.SEQUENCES,
                  train_X.shape[2]),
                  return_sequences=True))
        model.add(Attention(name='attention_weight'))
        model.add(layers.Dense(train_Y.shape[2], activation='sigmoid'))

They each behave a little different, and produce very different results. The self-attention library reduces the dimensions from 3 to 2 and when predicting you get a prediction per input vector. The general attention mechanism maintains the 3D data and outputs 3D, and when predicting you only get a prediction per batch. You can solve this by reshaping your prediction data to have batch sizes of 1 if you want predictions per input vector.

As for results, the self-attention did produce superior results to LSTM alone, but not better than other enhancements such as dropout or more dense, layers, etc. The general attention does not seem to add any benefit to an LSTM model and in many cases makes things worse, but I'm still investigating.

In any case, it can be done, but so far it's dubious if it should be done.

108

answered Sep 18 '22 17:09

greco.roamin

Related questions
                            
                                How to limit first line in Python docstrings to maximum line length during fill-paragraph in Emacs?
                            
                                Why can't I catch this python exception? Exception module/class doesn't match the catched module/class
                            
                                PyVmomi add NIC with unconnected dvs ('config.distributedVirtualSwitch' is Unset)
                            
                                Django REST - Create object with foreign key using serializers
                            
                                Django test error only with pycharm - Not the terminal | apps aren't loaded yet
                            
                                Plotly Chart at Tkinter Python
                            
                                Load image files in a directory as dataset for training in Tensorflow
                            
                                Pytest: Getting addresses of all tests
                            
                                Swig tool and C++. Being too clever
                            
                                AttributeError: 'IntegrityError' object has no attribute 'message' SQLAlchemy
                            
                                How to combine django "prefetch_related" and "values" methods?
                            
                                Do I need to use SQLAlchemy sessions?
                            
                                Performance enhancement of ranking function by replacement of lambda x with vectorization
                            
                                Devanagari text rendering improperly in PyGame
                            
                                OpenCV: reading frames from VideoCapture advances the video to bizarrely wrong location
                            
                                how to put a python dictionary in a protobuf message?
                            
                                Caching pipenv / Pipfile dependencies on TravisCI
                            
                                How to reliably check if a domain has been registered or is available?
                            
                                How do Assignment Expressions `:=` work in Python?
                            
                                How to call a Rust function from a Python file using pyo3?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Adding Attention on top of simple LSTM layer in Tensorflow 2.0

Tags:

python

tensorflow

keras

lstm

attention-model

greco.roamin

People also ask

1 Answers

greco.roamin

Recent Activity

Donate For Us