Text[Multi-Level] Classification with many outputs

Tags:

Problem Statement:

To classify a text document to which category it belongs and also to classify up to two levels of the category.

Sample Training Set:

Description Category    Level1  Level2
The gun shooting that happened in Vegas killed two  Crime | High    Crime   High
Donald Trump elected as President of America    Politics | High Politics    High
Rian won in football qualifier  Sports | Low    Sports  Low
Brazil won in football final    Sports | High   Sports  High

Initial Attempt:

I tried to create a classifier model which would try to classify the Category using Random forest method and it gave me 90% overall.

Code1:

import pandas as pd
#import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from stemming.porter2 import stem

from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score

stop = stopwords.words('english')
data_file = "Training_dataset_70k"

#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()

#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')
#data['Description'] = data['Description'].apply(lambda x: ' '.join([stem(word) for word in x.split()]))

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data.Category, test_size=0.3, random_state=100)

RF = RandomForestClassifier(n_estimators=10)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
Output_predict = RF.predict(test_data_feature)
print "Overall_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_Category.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue

Problem:

I want to add two more level to the model they are Level1 and Level2 the reasons for adding them is when I ran classification for Level1 alone I got 96% accuracy. I am stuck at splitting training and test dataset and to train a model which has three classifications.

Is it possible to create a model with three classification or should I create three models? How to split train and test data?

Edit1: import string import codecs import pandas as pd import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from stemming.porter2 import stem

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.model_selection import cross_val_score


stop = stopwords.words('english')

data_file = "Training_dataset_70k"
#Reading the input/ dataset
data = pd.read_csv( data_file, header = 0, delimiter= "\t", quoting = 3, encoding = "utf8")
data = data.dropna()
#Removing stopwords, punctuation and stemming
data['Description'] = data['Description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['Description'] = data['Description'].str.replace('[^\w\s]',' ').replace('\s+',' ')

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[["Category", "Level1", "Level2"]], test_size=0.3, random_state=100)
RF = RandomForestClassifier(n_estimators=2)
vectorizer = TfidfVectorizer( max_features = 40000, ngram_range = ( 1,3 ), sublinear_tf = True )
data_features = vectorizer.fit_transform( train_data )
print len(train_data), len(train_label)
print train_label
RF.fit(data_features, train_label)
test_data_feature = vectorizer.transform(test_data)
#print test_data_feature
Output_predict = RF.predict(test_data_feature)
print "BreadCrumb_Accuracy: " + str(np.mean(Output_predict == test_label))
with codecs.open("out_bread_crumb.txt", "w", "utf8") as out:
    for inp, pred, act in zip(test_data, Output_predict, test_label):
        try:
            out.write("{}\t{}\t{}\n".format(inp, pred, act))
        except:
            continue

592

asked Aug 25 '17 17:08

The6thSense

1 Answers

The scikit-learn Random Forest Classifier natively supports multiple outputs (see this example). Therefore, you do not need to create three separate models.

From the documentation of RandomForestClassifier.fit, the inputs to fit functions are:

X : array-like or sparse matrix of shape = [n_samples, n_features]

y : array-like, shape = [n_samples] or [n_samples, n_outputs]

Therefore, you need an array y (your labels) of size N x 3 as your input to your RandomForestClassifier. In order to split your training and test set, you can do:

train_data, test_data, train_label,  test_label = train_test_split(data.Description, data[['Category','Level 1','Level 2']], test_size=0.3, random_state=100)

Your train_label and test_label should be arrays of size N x 3 that you can use to fit your model compare your predictions (NB: I have not tested it here, you might need to do some transposes).

answered Sep 24 '22 22:09

nbeuchat

Related questions
                            
                                "diff -u -B -w" in python?
                            
                                When should a Task be used instead of a coroutine?
                            
                                User defined generic types and collections.abc
                            
                                FutureWarning when unstacking a pandas (v 0.17) dataframe
                            
                                How to enable tab and arrow keys using python win32gui
                            
                                In Django loaddata it throws errors for json format but work properly for yaml format, why?
                            
                                Flask not releasing memory
                            
                                Plotting with scientific axis, changing the number of significant figures
                            
                                Algorithm to exchange the roles of two randomly chosen nodes from a tree moving pointers
                            
                                TensorFlow: Performing this loss computation
                            
                                Django Proxy Field
                            
                                Can we use serializer_class attribute with APIView(django rest framework)?
                            
                                How to save plots from multiple python scripts using an interactive C# process command?
                            
                                Python scraping of javascript web pages fails for https pages only
                            
                                Providing SSL Connections in Python using PKCS#11
                            
                                Efficient way to set elements to zero where mask is True on scipy sparse matrix
                            
                                Pandas uses substantially more memory for storage than asked for
                            
                                Debugging a Neural Network
                            
                                Numpy Apply Along Axis and Get Row Index
                            
                                (Installing Python 3.6.1) SSLError: SSL: TLSV1_ALERT_UNKNOWN_CA tlsv1 alert unknown ca

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text[Multi-Level] Classification with many outputs

Tags:

python

python-2.7

classification

scikit-learn

The6thSense

People also ask

1 Answers

nbeuchat

Recent Activity

Donate For Us