Accuracy with TF-IDF and non-TF-IDF features

Tags:

I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.

In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.

Around 500 of them are the non-TF-IDF features.

The issue is that the accuracy of the Random Forest on the same test set etc with

- only the non-TF-IDF features is 87%

- the TF-IDF and non-TF-IDF features is 76%

This significant aggravation of the accuracy raises some questions in my mind.

The relevant piece of code of mine with the training of the models is the following:

drop_columns = ['labels', 'complete_text_1', 'complete_text_2']

# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values


# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])

vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])


# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)


# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])


# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)

Personally, I have not seen any bug in my code (this piece above and in general).

The hypothesis which I have formulated to explain this decrease in accuracy is the following.

The number of non-TF-IDF features is only 500 (out of the 130k features in total)
This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of max_features etc)
So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.

Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).

Can you explain differently the decrease in accuracy at my classifier?

In any case, what would you suggest doing?

Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.

One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features. Then the results of these two models will be combined either by (weighted) voting or meta-classification.

551

asked Jun 08 '20 18:06

Outcast

1 Answers

Your view that 130K of features is way too much for the Random forest sounds right. You didn't mention how many examples you have in your dataset and that would be cruccial to the choice of the possible next steps. Here are a few ideas on top of my head.

If number of datapoints is large enough you myabe want to train some transformation for the TF-IDF features - e.g. you might want to train a small-dimensional embeddings of these TF-IDF features into, say 64-dimensional space and then e.g. a small NN on top of that (even a linear model maybe). After you have embeddings you could use them as transforms to generate 64 additional features for each example to replace TF-IDF features for RandomForest training. Or alternatively just replace the whole random forest with a NN of such architecture that e.g. TF-IDFs are all combined into a few neurons via fully-connected layers and later concatened with other features (pretty much same as embeddings but as a part of NN).

If you don't have enough data to train a large NN maybe you can try to train GBDT ensemble instead of random forest. It probably should do much better job at picking the good features compared to random forest which definitely likely to be affected a lot by a lot of noisy useless features. Also you can first train some crude version and then do a feature selection based on that (again, I would expect it should do a more reasonable job compared to random forest).

193

answered Oct 24 '22 15:10

Alexander Pivovarov

Related questions
                            
                                How to extract tweets location which contain specific keyword using twitter API in Python
                            
                                How to use pytest fixture outside test run?
                            
                                RDKit installation under Windows and Python3.7.4
                            
                                Why doesn't this higher-order function pass static type checking in mypy?
                            
                                Unable to install ansible due to python dependency on Ubuntu 18.04
                            
                                Avoiding module namespace pollution in Python
                            
                                Using psutil.Process.memory_info memory usage differs from Pandas.memory_usage
                            
                                Getting "bad escape" when using nltk in py3
                            
                                How to implement parallel, delayed in such a way that the parallelized for loop stops when output goes below a threshold?
                            
                                Inference with TensorRT .engine file on python
                            
                                Difference on context manager with and without "as" clause
                            
                                How can I upload a PIL Image object to a Discord chat without saving the image?
                            
                                Jupyter Notebook Memory Management
                            
                                Py3: Can't open file /snapshot/serverless/lib/plugins/aws/invokeLocal/invoke.py : No such file or directory
                            
                                atom fail to start a terminal due to nuclide
                            
                                Python - psycopg2 giving error after execution
                            
                                Best way to handle path with pandas
                            
                                Pre-populate current value of WTForms field in order to edit it
                            
                                Bug in Numpy ndarray min/max method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Accuracy with TF-IDF and non-TF-IDF features

Tags:

python

machine-learning

random-forest

tf-idf

Outcast

People also ask

1 Answers

Alexander Pivovarov

Recent Activity

Donate For Us