Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use multiple text features for NLP classifier?

I am trying to build text classifier, Usually, we have one text column and ground truth. But I am working on a problem where dataset contains many text features. I am exploring different ways how to make use of different text features.

For example, my dataset looks like this

Index_no                   domain  comment_by   comment       research_paper      books_name

01                         Science  Professor   Thesis needs  Evolution of         MOIRCS 
                                                more work     Quiescent            Deep 
                                                              Galaxies as a        Survey
                                                              Function of
                                                              Stellar Mass       



02                         Math    Professor   Doesn't follow  Evolution of   
                                               Latex format   Quiescent           nonlinear 
                                                              Galaxies as a       dispersive
                                                              Function of         equations
                                                              Stellar Mass             

This is just a dummy dataset, Here my ground truth (Y) is domain and features are comment_by, comment, research_paper, books_name

If I am using any NLP model (RNN-LSTM, Transformers etc), those models usually take one 3 dim vectors, for that if I am using one text column that works but How to many text features for text classifier?

What I've tried :

1) Joining all column and making a long string

Professor Thesis needs more work Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey

2) Using a token between columns

<CB> Professor <C> Thesis needs more work <R> Evolution of Quiescent Galaxies as a Function of Stellar Mass <B> MOIRCS Deep Survey 

where <CB> comment_by , <C> comment, <R> research_paper, <B> books_name

Should I use <CB> at the beginning or use like this?

Professor <1> Thesis needs more work <2> Evolution of Quiescent Galaxies as a Function of Stellar Mass <3> MOIRCS Deep Survey

3) Using different dense layers (or embedding) for each column and concatenate them.

I've tried all three approaches, Is there any other approach I can try to improve the model accuracy? or extract, combine, join the better features?

Thanks in advance!

like image 668
Aaditya Ura Avatar asked May 23 '20 08:05

Aaditya Ura


People also ask

How do you handle text classification problems when multiple features are involved?

Try these things: Apply text preprocessing on 'job description', 'job designation' and 'key skills. Remove all stop words, separate each words removing punctuations, lowercase all words then apply TF-IDF or Count Vectorizer, don't forget to scale these features before training model.

How is NLP used in text classification?

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

How do you combine textual and numerical features of machine learning?

To combine text features and numerical features follow this: For Numerical Features , use Normalisation or Column Standardization to scale the numerical data. If in case you also want to use Categorical Features, then use OneHotEncoding, LabelEncoding, ResponseCoding etc , to vectorise the Categorical Features.


1 Answers

Here are some of the things you could try:

1.) Combine research_paper, book_name and comment into one string.

2.) Treat comment_by as a categorical variable and encode it using one hot encoder or label encoder.

3.) Whatever model you are using, tune the hyperparameters to improve the results.

Do let me know the results you got!

like image 81
spectre Avatar answered Oct 29 '22 07:10

spectre