Is it possible to use n-grams in Keras?
E.g., sentences contain in X_train dataframe with "sentences" column.
I use tokenizer from Keras in the following manner:
tokenizer = Tokenizer(lower=True, split=' ')
tokenizer.fit_on_texts(X_train.sentences)
X_train_tokenized = tokenizer.texts_to_sequences(X_train.sentences)
And later I pad the sentences thus:
X_train_sequence = sequence.pad_sequences(X_train_tokenized)
Also I use a simple LSTM network:
model = Sequential()
model.add(Embedding(MAX_FEATURES, 128))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2,
activation='tanh', return_sequences=True))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model.add(Dense(number_classes, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
In this case, tokenizer execution. In Keras docs: https://keras.io/preprocessing/text/ I see character processing is possible, but that is not appropriate for my case.
My main question: Can I use n-grams for NLP tasks (not only Sentiment Analysis but rather any NLP task)
For clarification: I'd like to consider not just words but combination of words. I'd like to try and see if it helps to model my task.
Unfortunately, Keras Tokenizer() does not support n-grams. You should create a workaround and tokenize on your own the documents, and then feed them to the neural network.
If you are not aware, you can use sklearn modules like CountVectorizer or TfidfVectorizer to generate n-grams which you can then feed to the network.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With