I want to create a python script using NLTK or whatever library is best to correctly identify given sentence is interrogative (a question) or not. I tried using regex but there are deeper scenarios where regex fails. so wanted to use Natural Language Processing can anybody help!
This will probably solve your question.
Here is the code:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
And that should print something like 0.67, which is decent accuracy. If you want to process a string of text through this classifier, try:
print(classifier.classify(dialogue_act_features(line)))
And you can categorise strings into whether they are ynQuestion, Statement, etc, and extract what you desire.
This approach was using NaiveBayes which in my opinion is the easiest, however surely there are many ways to process this. Hope this helps!
From the answer of @PolkaDot, I created the function that uses NLTK and then some custom code to get more accuracy.
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
# 10% of the total data
size = int(len(featuresets) * 0.1)
# first 10% for test_set to check the accuracy, and rest 90% after the first 10% for training
train_set, test_set = featuresets[size:], featuresets[:size]
# get the classifer from the training set
classifier = nltk.NaiveBayesClassifier.train(train_set)
# to check the accuracy - 0.67
# print(nltk.classify.accuracy(classifier, test_set))
question_types = ["whQuestion","ynQuestion"]
def is_ques_using_nltk(ques):
question_type = classifier.classify(dialogue_act_features(ques))
return question_type in question_types
and then
question_pattern = ["do i", "do you", "what", "who", "is it", "why","would you", "how","is there",
"are there", "is it so", "is this true" ,"to know", "is that true", "are we", "am i",
"question is", "tell me more", "can i", "can we", "tell me", "can you explain",
"question","answer", "questions", "answers", "ask"]
helping_verbs = ["is","am","can", "are", "do", "does"]
# check with custom pipeline if still this is a question mark it as a question
def is_question(question):
question = question.lower().strip()
if not is_ques_using_nltk(question):
is_ques = False
# check if any of pattern exist in sentence
for pattern in question_pattern:
is_ques = pattern in question
if is_ques:
break
# there could be multiple sentences so divide the sentence
sentence_arr = question.split(".")
for sentence in sentence_arr:
if len(sentence.strip()):
# if question ends with ? or start with any helping verb
# word_tokenize will strip by default
first_word = nltk.word_tokenize(sentence)[0]
if sentence.endswith("?") or first_word in helping_verbs:
is_ques = True
break
return is_ques
else:
return True
you just need to use is_question
method to check if passed sentence is question or not.
You can improved the PolkaDot solution and reach an accuracy of around 86% with a simple Gradient Boosting by using the sklearn library. That would come up to something like this:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()
posts_text = [post.text for post in posts]
#divide train and test in 80 20
train_text = posts_text[:int(len(posts_text)*0.8)]
test_text = posts_text[int(len(posts_text)*0.2):]
#Get TFIDF features
vectorizer = TfidfVectorizer(ngram_range=(1,3),
min_df=0.001,
max_df=0.7,
analyzer='word')
X_train = vectorizer.fit_transform(train_text)
X_test = vectorizer.transform(test_text)
y = [post.get('class') for post in posts]
y_train = y[:int(len(posts_text)*0.8)]
y_test = y[int(len(posts_text)*0.2):]
# Fitting Gradient Boosting classifier to the Training set
gb = GradientBoostingClassifier(n_estimators = 400, random_state=0)
#Can be improved with Cross Validation
gb.fit(X_train, y_train)
predictions_rf = gb.predict(X_test)
#Accuracy of 86% not bad
print(classification_report(y_test, predictions_rf))
Then you can use the model to make predictions on new data by using gb.predict(vectorizer.transform(['new sentence here'])
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With