I have a retail dataset with product_description, price, supplier, category
as columns.
I used product_description
as feature:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
Since the accuracy is very low, I want to add the supplier and price as features too. How can I incorporate this in the code?
I have tried other classifiers like LR, SVM, and Random Forrest, but they had (almost) the same outcome.
Adding bigrams to feature set will improve the accuracy of text classification model. it's better to train the model, such that word “book” when used as NOUN means “book of pages”, when used as VERB means to “book a ticket or something else”.
Try these things: Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.
XGBoost is the name of a machine learning method. It can help you to predict any kind of data if you have already predicted data before. You can classify any kind of data. It can be used for text classification too.
The TF-IDF vectorizer returns a matrix: one row per example with the scores. You can modify this matrix as you wish before feeding it into the classifier.
Prepare your additional features as a NumPy array of shape: number of examples × number of features.
Use np.concatenate
with axis=1
.
Fit the classifier as you did before.
It is usually a good idea to normalize real-valued features. Also, you can try different classifiers: Logistic Regression or SVM might do a better job for real-valued features than Naive Bayes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With