Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with combination of text and numeric features?

Looking at Kaggel's Job Salary Prediction, I see numeric features (like Category) and textual ones (like FullDescription).

How do I go about training on such data? I thought about vectorizing the text using TfidfTransformer, however it creates sparse matrix which many learning algorithms (such as RandomForestRegressor) refuse to work with. Also, once I have the feature vector for the text, how do I combine it with other features?

Any pointers on how to work with such data?

Thanks!

like image 412
lazy1 Avatar asked May 30 '13 03:05

lazy1


1 Answers

I would first learn a linear model on the tf-idf features of each text field independently and add the linear models predictions as a additional feature to the other features and train an ExtraTreesRegressor or GradientBoostedTreeRegressor on the combined features.

like image 80
ogrisel Avatar answered Oct 13 '22 03:10

ogrisel