I am trying to build text classifier, Usually, we have one text column and ground truth. But I am working on a problem where dataset contains many text features. I am exploring different ways how to make use of different text features.
For example, my dataset looks like this
Index_no domain comment_by comment research_paper books_name
01 Science Professor Thesis needs Evolution of MOIRCS
more work Quiescent Deep
Galaxies as a Survey
Function of
Stellar Mass
02 Math Professor Doesn't follow Evolution of
Latex format Quiescent nonlinear
Galaxies as a dispersive
Function of equations
Stellar Mass
This is just a dummy dataset, Here my ground truth (Y) is domain and features are comment_by
, comment
, research_paper
, books_name
If I am using any NLP model (RNN-LSTM, Transformers etc), those models usually take one 3 dim vectors, for that if I am using one text column that works but How to many text features for text classifier?
What I've tried :
1) Joining all column and making a long string
Professor Thesis needs more work Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey
2) Using a token between columns
<CB> Professor <C> Thesis needs more work <R> Evolution of Quiescent Galaxies as a Function of Stellar Mass <B> MOIRCS Deep Survey
where <CB>
comment_by , <C>
comment, <R>
research_paper, <B>
books_name
Should I use <CB>
at the beginning or use like this?
Professor <1> Thesis needs more work <2> Evolution of Quiescent Galaxies as a Function of Stellar Mass <3> MOIRCS Deep Survey
3) Using different dense layers (or embedding) for each column and concatenate them.
I've tried all three approaches, Is there any other approach I can try to improve the model accuracy? or extract, combine, join the better features?
Thanks in advance!
Try these things: Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.
Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.
To combine text features and numerical features follow this: For Numerical Features , use Normalisation or Column Standardization to scale the numerical data. If in case you also want to use Categorical Features, then use OneHotEncoding, LabelEncoding, ResponseCoding etc , to vectorise the Categorical Features.
Here are some of the things you could try:
1.) Combine research_paper
, book_name
and comment
into one string.
2.) Treat comment_by
as a categorical variable and encode it using one hot encoder
or label encoder.
3.) Whatever model you are using, tune the hyperparameters to improve the results.
Do let me know the results you got!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With