Problem Statement:
To classify a text document to which category it belongs and also to classify up to two levels of the category.
Sample Training Set:
Description Category Level1 Level2
The gun shooting that happened in Vegas killed two Crime | High Crime High
Donald Trump elected as President of America Politics | High Politics High
Rian won in football qualifier Sports | Low Sports Low
Brazil won in football final Sports | High Sports High
Initial Attempt:
I tried to create a classifier model which would try to classify the Category using Random forest method and it gave me 90% overall.
Code1:
import pandas as pd
#import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))
train_data, test_data, train_label, test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
Problem:
I want to add two more level to the model they are Level1 and Level2 the reasons for adding them is when I ran classification for Level1 alone I got 96% accuracy. I am stuck at splitting training and test dataset and to train a model which has three classifications.
Is it possible to create a model with three classification or should I create three models? How to split train and test data?
Edit1: import string import codecs import pandas as pd import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import cross_val_score
stop = stopwords.words('english')
data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
for inp, pred, act in zip(test_data, Output_predict, test_label):
try:
out.write("{}\t{}\t{}\n".format(inp, pred, act))
except:
continue
Multi-output classification is a type of machine learning that predicts multiple outputs simultaneously. In multi-output classification, the model will give two or more outputs after making any prediction. In other types of classifications, the model usually predicts only a single output.
Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
Multi-label classification involves predicting zero or more class labels. Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually non-exclusive classes or “labels.”
The multi-label classification problem is actually a subset of multiple output model. At the end of this article you will be able to perform multi-label text classification on your data. The approach explained in this article can be extended to perform general multi-label classification.
As you can see, following some very basic steps and using a simple linear model, we were able to reach as high as an 79% accuracy on this multi-class text classification data set. Using the same data set, we are going to try some advanced techniques such as word embedding and neural networks.
The majority of the output classes are binary, but one of them is a multiclass output. As explained in the Multiple Losses section, the losses used are: binary_crossentropy and sparse_categorical_crossentropy . Since the dataset is highly imbalanced, the class_weight parameter was added in order to reduce the imbalanced distributions.
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms. We achieve a higher accuracy score of 79% which is 5% improvement over Naive Bayes. Logistic regression is a simple and easy to understand classification algorithm, and Logistic regression can be easily generalized to multiple classes.
The scikit-learn Random Forest Classifier natively supports multiple outputs (see this example). Therefore, you do not need to create three separate models.
From the documentation of RandomForestClassifier.fit, the inputs to fit
functions are:
X : array-like or sparse matrix of shape = [n_samples, n_features]
y : array-like, shape = [n_samples] or [n_samples, n_outputs]
Therefore, you need an array y
(your labels) of size N x 3 as your input to your RandomForestClassifier. In order to split your training and test set, you can do:
train_data, test_data, train_label, test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)
Your train_label
and test_label
should be arrays of size N x 3 that you can use to fit your model compare your predictions (NB: I have not tested it here, you might need to do some transposes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With