How to combine TFIDF features with other features

Tags:

I have a classic NLP problem, I have to classify a news as fake or real.

I have created two sets of features:

A) Bigram Term Frequency-Inverse Document Frequency

B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...

Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone.

518

asked Feb 01 '18 23:02

Massifox

Video Answer

1 Answers

Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.

Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).

That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.

112

answered Sep 18 '22 14:09

Usherwood

Related questions
                            
                                PyTorch: is there a definitive training loop similar to Keras' fit()?
                            
                                How to sample large database and implement K-means and K-nn in R?
                            
                                How to integrate Apache Spark with Spring MVC web application for interactive user sessions
                            
                                Machine learning project: split training/test sets before or after exploratory data analysis?
                            
                                Reinforcement learning in C# [closed]
                            
                                How do you actually apply a trained model?
                            
                                Auto-encoders with tied weights in Caffe
                            
                                Choosing random_state for sklearn algorithms
                            
                                Keras. ValueError: I/O operation on closed file
                            
                                Cross validation with grid search returns worse results than default
                            
                                Is it possible to add your own WordNet to a library?
                            
                                Supervised Motion Detection Library
                            
                                Assign new data point to cluster in kernel k-means (kernlab package in R)?
                            
                                How to obtain information gain from a scikit-learn DecisionTreeClassifier?
                            
                                Python's implementation of Mutual Information
                            
                                what's the use of transformer_weights in scikit-learn pipeline?
                            
                                difference between LinearRegression and svm.SVR(kernel="linear")
                            
                                What are good algorithms for detecting abnormality?
                            
                                Miminum requirements for Google tensorflow image classifier
                            
                                Fit mixture of Gaussians with fixed covariance in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to combine TFIDF features with other features

Tags:

machine-learning

nlp

text-analysis

Massifox

People also ask

Video Answer

1 Answers

Usherwood

Recent Activity

Donate For Us