Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accuracy with TF-IDF and non-TF-IDF features

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.

In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.

Around 500 of them are the non-TF-IDF features.

The issue is that the accuracy of the Random Forest on the same test set etc with

- only the non-TF-IDF features is 87%

- the TF-IDF and non-TF-IDF features is 76%

This significant aggravation of the accuracy raises some questions in my mind.

The relevant piece of code of mine with the training of the models is the following:

drop_columns = ['labels', 'complete_text_1', 'complete_text_2']

# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values


# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])

vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])


# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)


# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])


# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)

Personally, I have not seen any bug in my code (this piece above and in general).

The hypothesis which I have formulated to explain this decrease in accuracy is the following.

  1. The number of non-TF-IDF features is only 500 (out of the 130k features in total)
  2. This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
  3. So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.

Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).

Can you explain differently the decrease in accuracy at my classifier?

In any case, what would you suggest doing?

Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.

One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features. Then the results of these two models will be combined either by (weighted) voting or meta-classification.

like image 551
Outcast Avatar asked Jun 08 '20 18:06

Outcast


People also ask

How accurate is TF-IDF?

TF-IDF got the maximum accuracy (93.81%), precision (94.20%), recall (93.81%), and F1-score (91.99%) value in Random Forest classifier.

What is TF-IDF and what does it do to make text classification improve its accuracy?

Transforms text to feature vectors Bag of Words and TF-IDF are two methods that are used to detect the topic of a document. The difference between them is, BoW uses the number of times that a word appears in a document as a metric, while TF-IDF gives each word a weight on detecting the topic.

Why does the TF-IDF approach generally result in a better accuracy than bag of words?

Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well.


1 Answers

Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.

If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).

If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).

like image 193
Alexander Pivovarov Avatar answered Oct 24 '22 15:10

Alexander Pivovarov