Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add more features in multi text classification?

I have a retail dataset with product_description, price, supplier, category as columns. I used product_description as feature:

from sklearn import model_selection, preprocessing, naive_bayes

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])

# label encode the target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)

# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low

Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?

I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.

like image 207
Snow Avatar asked Aug 10 '20 09:08

Snow


People also ask

How can I improve my text classification?

Adding bigrams to feature set will improve the accuracy of text classification model. it's better to train the model, such that word “book” when used as NOUN means “book of pages”, when used as VERB means to “book a ticket or something else”.

How do you handle text classification problems when multiple features are involved?

Try these things: Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.

Which algorithm is best for multiclass text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

Is XGBoost good for text classification?

XGBoost is the name of a machine learning method. It can help you to predict any kind of data if you have already predicted data before. You can classify any kind of data. It can be used for text classification too.


Video Answer


1 Answers

The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.

  • Prepare your additional features as a NumPy array of shape: number of examples × number of features.

  • Use np.concatenate with axis=1.

  • Fit the classifier as you did before.

It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.

like image 183
Jindřich Avatar answered Oct 10 '22 08:10

Jindřich