Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text[Multi-Level] Classification with many outputs

Problem Statement:

To classify a text document to which category it belongs and also to classify up to two levels of the category.

Sample Training Set:

Description Category    Level1  Level2
The gun shooting that happened in Vegas killed two  Crime | High    Crime   High
Donald Trump elected as President of America    Politics | High Politics    High
Rian won in football qualifier  Sports | Low    Sports  Low
Brazil won in football final    Sports | High   Sports  High

Initial Attempt:

I tried to create a classifier model which would try to classify the Category using Random forest method and it gave me 90% overall.

Code1:

import pandas as pd
#import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem

from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score

stop = stopwords.words('english')
data_file = "Training_dataset_70k"

#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()

#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)

RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue

Problem:

I want to add two more level to the model they are Level1 and Level2 the reasons for adding them is when I ran classification for Level1 alone I got 96% accuracy. I am stuck at splitting training and test dataset and to train a model which has three classifications.

Is it possible to create a model with three classification or should I create three models? How to split train and test data?

Edit1: import string import codecs import pandas as pd import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score


stop = stopwords.words('english')

data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue
like image 592
The6thSense Avatar asked Aug 25 '17 17:08

The6thSense


People also ask

What is multi output classification?

Multi-output classification is a type of machine learning that predicts multiple outputs simultaneously. In multi-output classification, the model will give two or more outputs after making any prediction. In other types of classifications, the model usually predicts only a single output.

What is the difference between multi-class and multi-label classification?

Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.

How does Multilabel classification work?

Multi-label classification involves predicting zero or more class labels. Unlike normal classification tasks where class labels are mutually exclusive, multi-label classification requires specialized machine learning algorithms that support predicting multiple mutually non-exclusive classes or “labels.”

What is multi-label Text Classification?

The multi-label classification problem is actually a subset of multiple output model. At the end of this article you will be able to perform multi-label text classification on your data. The approach explained in this article can be extended to perform general multi-label classification.

How accurate is multi-class text classification?

As you can see, following some very basic steps and using a simple linear model, we were able to reach as high as an 79% accuracy on this multi-class text classification data set. Using the same data set, we are going to try some advanced techniques such as word embedding and neural networks.

What is a multiclass output class?

The majority of the output classes are binary, but one of them is a multiclass output. As explained in the Multiple Losses section, the losses used are: binary_crossentropy and sparse_categorical_crossentropy . Since the dataset is highly imbalanced, the class_weight parameter was added in order to reduce the imbalanced distributions.

What is the best text classification algorithm?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms. We achieve a higher accuracy score of 79% which is 5% improvement over Naive Bayes. Logistic regression is a simple and easy to understand classification algorithm, and Logistic regression can be easily generalized to multiple classes.


1 Answers

The scikit-learn Random Forest Classifier natively supports multiple outputs (see this example). Therefore, you do not need to create three separate models.

From the documentation of RandomForestClassifier.fit, the inputs to fit functions are:

X : array-like or sparse matrix of shape = [n_samples, n_features]

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

Therefore, you need an array y (your labels) of size N x 3 as your input to your RandomForestClassifier. In order to split your training and test set, you can do:

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)

Your train_label and test_label should be arrays of size N x 3 that you can use to fit your model compare your predictions (NB: I have not tested it here, you might need to do some transposes).

like image 95
nbeuchat Avatar answered Sep 24 '22 22:09

nbeuchat