Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras?

E.g., sentences contain in X_train dataframe with "sentences" column.

I use tokenizer from Keras in the following manner:

tokenizer = Tokenizer(lower=True, split=' ')
tokenizer.fit_on_texts(X_train.sentences)
X_train_tokenized = tokenizer.texts_to_sequences(X_train.sentences)

And later I pad the sentences thus:

X_train_sequence = sequence.pad_sequences(X_train_tokenized)

Also I use a simple LSTM network:

model = Sequential()
model.add(Embedding(MAX_FEATURES, 128))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2,
               activation='tanh', return_sequences=True))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model.add(Dense(number_classes, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
              metrics=['accuracy'])

In this case, tokenizer execution. In Keras docs: https://keras.io/preprocessing/text/ I see character processing is possible, but that is not appropriate for my case.

My main question: Can I use n-grams for NLP tasks (not only Sentiment Analysis but rather any NLP task)

For clarification: I'd like to consider not just words but combination of words. I'd like to try and see if it helps to model my task.

like image 544
Simplex Avatar asked Sep 12 '17 10:09

Simplex


2 Answers

Unfortunately, Keras Tokenizer() does not support n-grams. You should create a workaround and tokenize on your own the documents, and then feed them to the neural network.

like image 181
Alex Avatar answered Oct 04 '22 10:10

Alex


If you are not aware, you can use sklearn modules like CountVectorizer or TfidfVectorizer to generate n-grams which you can then feed to the network.

like image 44
Satheesh K Avatar answered Oct 04 '22 09:10

Satheesh K