Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert predicted sequence back to text in keras?

Tags:

I have a sequence to sequence learning model which works fine and able to predict some outputs. The problem is I have no idea how to convert the output back to text sequence.

This is my code.

from keras.preprocessing.text import Tokenizer,base_filter from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense  txt1="""What makes this problem difficult is that the sequences can vary in length, be comprised of a very large vocabulary of input symbols and may require the model  to learn the long term context or dependencies between symbols in the input sequence."""  #txt1 is used for fitting  tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ") tk.fit_on_texts(txt1)  #convert text to sequence t= tk.texts_to_sequences(txt1)  #padding to feed the sequence to keras model t=pad_sequences(t, maxlen=10)  model = Sequential() model.add(Dense(10,input_dim=10)) model.add(Dense(10,activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])  #predicting new sequcenc pred=model.predict(t)  #Convert predicted sequence to text pred=?? 
like image 201
Eka Avatar asked Feb 01 '17 03:02

Eka


People also ask

How do I encode text in keras?

Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case. Instead, the function is a wrapper for the hashing_trick() function described in the next section.

What is Word_index?

The word_index assigns a unique index to each word present in the text. This unique integer helps the model during training purposes. In [4]: print("The word index",t. word_index)

Does keras Tokenizer remove punctuation?

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens.


1 Answers

You can use directly the inverse tokenizer.sequences_to_texts function.

text = tokenizer.sequences_to_texts(<list-of-integer-equivalent-encodings>) 

I have tested the above and it works as expected.

PS.: Take extra care to make the argument be the list of the integer encodings and not the One Hot ones.

like image 197
Jairo Alves Avatar answered Oct 09 '22 11:10

Jairo Alves