I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms. In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion. 1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification. 2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression. 3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression. All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided. Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.

Try these things: <ol> <li>Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.</li> <li>Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.</li> <li>Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.</li> <li>Try combining job type and key skills and then do vectorization, see how if it works better.</li> <li>Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.</li> </ol> Hopefully, these will increase the performance of the model. Let me know how is it performing with these.

How to handle text classification problems when multiple features are involved

Tags:

python

nlp

text-classification

feature-extraction

I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.

In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.

1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.

2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.

3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.

All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.

Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.

258

asked Dec 26 '18 07:12

Chetan Ambi

1 Answers

Try these things:

Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Convert Experience to Minimum experience and Maximum experience 2 features and treat is as a discrete numeric feature.
Company and location can be treated as a categorical feature and create dummy variable/one hot encoding before training the model.
Try combining job type and key skills and then do vectorization, see how if it works better.
Use Random Forest Regressor, tune hyperparameters: n_estimators, max_depth, max_features using GridCV.

Hopefully, these will increase the performance of the model.

Let me know how is it performing with these.

answered Oct 03 '22 19:10

Ayush Kesarwani

Related questions
                            
                                Is it possible to split the training DataLoader (and dataset) into training and validation datasets?
                            
                                how to update scan Cython code in Theano?
                            
                                ML Engine Runtime version and Python version not supported
                            
                                Django - Admin - on form change
                            
                                Python: How to create and use a custom logger in python use logging module?
                            
                                Set Pandas column values to an array
                            
                                Syntax confusion during calling of functions from python classes [duplicate]
                            
                                Getting "TypeError: can't pickle thread.lock objects" when an object is deepcopied with log configs
                            
                                Sharing a counter with multiprocessing.Pool
                            
                                Python Square Root for Class Instances
                            
                                Use of Breakpoint Method
                            
                                How can I rename a PySpark dataframe column by index? (handle duplicated column names)
                            
                                Boxplot with Pandas in Python
                            
                                How to display actual values instead of percentages on my pie chart using matplotlibs [duplicate]
                            
                                RuntimeError: Too early to create image [duplicate]
                            
                                Extracting number of days from timedelta column in pandas
                            
                                Pytest - test case execution order
                            
                                float() argument must be a string or a number, not 'Timestamp'
                            
                                Variable scopes inside class definitions are confusing
                            
                                TypeError: object of type 'numpy.int64' has no len()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With