Using Deep Learning to Predict Subsequence from Sequence

Question

I have a data that looks like this:

enter image description here

It can be viewed here and has been included in the code below. In actuality I have ~7000 samples (row), downloadable too.

The task is given antigen, predict the corresponding epitope. So epitope is always an exact substring of antigen. This is equivalent with the Sequence to Sequence Learning. Here is my code running on Recurrent Neural Network under Keras. It was modeled according the example.

My question are:

Can RNN, LSTM or GRU used to predict subsequence as posed above?
How can I improve the accuracy of my code?
How can I modify my code so that it can run faster?

Here is my running code which gave very bad accuracy score.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import json
import pandas as pd
from keras.models import Sequential
from keras.engine.training import slice_X
from keras.layers.core import Activation,  RepeatVector, Dense
from keras.layers import recurrent, TimeDistributed
import numpy as np
from six.moves import range

class CharacterTable(object):
    '''
    Given a set of characters:
    + Encode them to a one hot integer representation
    + Decode the one hot integer representation to their character output
    + Decode a vector of probabilties to their character output
    '''
    def __init__(self, chars, maxlen):
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
        self.maxlen = maxlen

    def encode(self, C, maxlen=None):
        maxlen = maxlen if maxlen else self.maxlen
        X = np.zeros((maxlen, len(self.chars)))
        for i, c in enumerate(C):
            X[i, self.char_indices[c]] = 1
        return X

    def decode(self, X, calc_argmax=True):
        if calc_argmax:
            X = X.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in X)

class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

INVERT = True
HIDDEN_SIZE = 128
BATCH_SIZE = 64
LAYERS = 3
# Try replacing GRU, or SimpleRNN
RNN = recurrent.LSTM


def main():
    """
    Epitope_core = answers
    Antigen      = questions
    """

    epi_antigen_df = pd.io.parsers.read_table("http://dpaste.com/2PZ9WH6.txt")
    antigens = epi_antigen_df["Antigen"].tolist()
    epitopes = epi_antigen_df["Epitope Core"].tolist()

    if INVERT:
        antigens = [ x[::-1] for x in antigens]

    allchars = "".join(antigens+epitopes)
    allchars = list(set(allchars))
    aa_chars =  "".join(allchars)
    sys.stderr.write(aa_chars + "
")

    max_antigen_len = len(max(antigens, key=len))
    max_epitope_len = len(max(epitopes, key=len))

    X = np.zeros((len(antigens),max_antigen_len, len(aa_chars)),dtype=np.bool)
    y = np.zeros((len(epitopes),max_epitope_len, len(aa_chars)),dtype=np.bool)

    ctable = CharacterTable(aa_chars, max_antigen_len)

    sys.stderr.write("Begin vectorization
")
    for i, antigen in enumerate(antigens):
        X[i] = ctable.encode(antigen, maxlen=max_antigen_len)
    for i, epitope in enumerate(epitopes):
        y[i] = ctable.encode(epitope, maxlen=max_epitope_len)


    # Shuffle (X, y) in unison as the later parts of X will almost all be larger digits
    indices = np.arange(len(y))
    np.random.shuffle(indices)
    X = X[indices]
    y = y[indices]

    # Explicitly set apart 10% for validation data that we never train over
    split_at = len(X) - len(X) / 10
    (X_train, X_val) = (slice_X(X, 0, split_at), slice_X(X, split_at))
    (y_train, y_val) = (y[:split_at], y[split_at:])

    sys.stderr.write("Build model
")
    model = Sequential()
    # "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE
    # note: in a situation where your input sequences have a variable length,
    # use input_shape=(None, nb_feature).
    model.add(RNN(HIDDEN_SIZE, input_shape=(max_antigen_len, len(aa_chars))))
    # For the decoder's input, we repeat the encoded input for each time step
    model.add(RepeatVector(max_epitope_len))
    # The decoder RNN could be multiple layers stacked or a single layer
    for _ in range(LAYERS):
        model.add(RNN(HIDDEN_SIZE, return_sequences=True))

    # For each of step of the output sequence, decide which character should be chosen
    model.add(TimeDistributed(Dense(len(aa_chars))))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy',
                optimizer='adam',
                metrics=['accuracy'])

    # Train the model each generation and show predictions against the validation dataset
    for iteration in range(1, 200):
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X_train, y_train, batch_size=BATCH_SIZE, nb_epoch=5,
                validation_data=(X_val, y_val))
        ###
        # Select 10 samples from the validation set at random so we can visualize errors
        for i in range(10):
            ind = np.random.randint(0, len(X_val))
            rowX, rowy = X_val[np.array([ind])], y_val[np.array([ind])]
            preds = model.predict_classes(rowX, verbose=0)
            q = ctable.decode(rowX[0])
            correct = ctable.decode(rowy[0])
            guess = ctable.decode(preds[0], calc_argmax=False)
            # print('Q', q[::-1] if INVERT else q)
            print('T', correct)
            print(colors.ok + '☑' + colors.close if correct == guess else colors.fail + '☒' + colors.close, guess)
            print('---')

if __name__ == '__main__':
    main()

1'' · Accepted Answer

Can RNN, LSTM or GRU used to predict subsequence as posed above?

Yes, you can use any of these. LSTMs and GRUs are types of RNNs; if by RNN you mean a fully-connected RNN, these have fallen out of favor because of the vanishing gradients problem (1, 2). Because of the relatively small number of examples in your dataset, a GRU might be preferable to an LSTM due to its simpler architecture.

How can I improve the accuracy of my code?

You mentioned that training and validation error are both bad. In general, this could be due to one of several factors:

The learning rate is too low (not an issue since you're using Adam, a per-parameter adaptive learning rate algorithm)
The model is too simple for the data (not at all the issue, since you have a very complex model and a small dataset)
You have vanishing gradients (probably the issue since you have a 3-layer RNN). Try reducing the number of layers to 1 (in general, it's good to start by getting a simple model working and then increase the complexity), and also consider hyperparameter search (e.g. a 128-dimensional hidden state may be too large - try 30?).

Another option, since your epitope is a substring of your input, is to predict the start and end indices of the epitope within the antigen sequence (potentially normalized by the length of the antigen sequence) instead of predicting the substring one character at a time. This would be a regression problem with two tasks. For instance, if the antigen is FSKIAGLTVT (10 letters long) and its epitope is KIAGL (positions 3 to 7, one-based) then the input would be FSKIAGLTVT and the outputs would be 0.3 (first task) and 0.7 (second task).

Alternatively, if you can make all the antigens be the same length (by removing parts of your dataset with short antigens and/or chopping off the ends of long antigens assuming you know a priori that the epitope is not near the ends), you can frame it as a classification problem with two tasks (start and end) and sequence-length classes, where you're trying to assign a probability to the antigen starting and ending at each of the positions.

How can I modify my code so that it can run faster?

Reducing the number of layers will speed your code up significantly. Also, GRUs will be faster than LSTMs due to their simpler architecture. However, both types of recurrent networks will be slower than, e.g. convolutional networks.

Feel free to send me an email (address in my profile) if you're interested in a collaboration.

Using Deep Learning to Predict Subsequence from Sequence

Tags:

python

deep-learning

keras

theano

neversaint

Video Answer

1 Answers

1''

Recent Activity

Donate For Us

Using Deep Learning to Predict Subsequence from Sequence

Tags:

python

deep-learning

keras

theano

neversaint

Video Answer

1 Answers

1''

Related questions

Recent Activity

Donate For Us