I am trying to build text classifier, Usually, we have one text column and ground truth. But I am working on a problem where dataset contains many text features. I am exploring different ways how to make use of different text features. For example, my dataset looks like this <pre class="prettyprint"><code>Index_no domain comment_by comment research_paper books_name 01 Science Professor Thesis needs Evolution of MOIRCS more work Quiescent Deep Galaxies as a Survey Function of Stellar Mass 02 Math Professor Doesn't follow Evolution of Latex format Quiescent nonlinear Galaxies as a dispersive Function of equations Stellar Mass </code></pre> This is just a dummy dataset, Here my ground truth (Y) is domain and features are <code>comment_by</code>, <code>comment</code>, <code>research_paper</code>, <code>books_name</code> If I am using any NLP model (RNN-LSTM, Transformers etc), those models usually take one 3 dim vectors, for that if I am using one text column that works but How to many text features for text classifier? What I've tried : <blockquote> 1) Joining all column and making a long string </blockquote> Professor Thesis needs more work Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey <blockquote> 2) Using a token between columns </blockquote> <pre class="prettyprint"><code><CB> Professor <C> Thesis needs more work <R> Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey </code></pre> where <code><CB></code> comment_by , <code><C></code> comment, <code><R></code> research_paper, <code></code> books_name Should I use <code><CB></code> at the beginning or use like this? <pre class="prettyprint"><code>Professor <1> Thesis needs more work <2> Evolution of Quiescent Galaxies as a Function of Stellar Mass <3> MOIRCS Deep Survey </code></pre> <blockquote> 3) Using different dense layers (or embedding) for each column and concatenate them. </blockquote> I've tried all three approaches, Is there any other approach I can try to improve the model accuracy? or extract, combine, join the better features? Thanks in advance!

Here are some of the things you could try: 1.) Combine <code>research_paper</code>, <code>book_name</code> and <code>comment</code> into one string. 2.) Treat <code>comment_by</code> as a categorical variable and encode it using one hot encoder or label encoder. 3.) Whatever model you are using, tune the hyperparameters to improve the results. Do let me know the results you got!

How to use multiple text features for NLP classifier?

Tags:

machine-learning

neural-network

deep-learning

nlp

keras

I am trying to build text classifier, Usually, we have one text column and ground truth. But I am working on a problem where dataset contains many text features. I am exploring different ways how to make use of different text features.

For example, my dataset looks like this

Index_no                   domain  comment_by   comment       research_paper      books_name

01                         Science  Professor   Thesis needs  Evolution of         MOIRCS 
                                                more work     Quiescent            Deep 
                                                              Galaxies as a        Survey
                                                              Function of
                                                              Stellar Mass       



02                         Math    Professor   Doesn't follow  Evolution of   
                                               Latex format   Quiescent           nonlinear 
                                                              Galaxies as a       dispersive
                                                              Function of         equations
                                                              Stellar Mass

This is just a dummy dataset, Here my ground truth (Y) is domain and features are comment_by, comment, research_paper, books_name

If I am using any NLP model (RNN-LSTM, Transformers etc), those models usually take one 3 dim vectors, for that if I am using one text column that works but How to many text features for text classifier?

What I've tried :

1) Joining all column and making a long string

Professor Thesis needs more work Evolution of Quiescent Galaxies as a Function of Stellar Mass MOIRCS Deep Survey

2) Using a token between columns

<CB> Professor <C> Thesis needs more work <R> Evolution of Quiescent Galaxies as a Function of Stellar Mass <B> MOIRCS Deep Survey

where <CB> comment_by , <C> comment, <R> research_paper,  books_name

Should I use <CB> at the beginning or use like this?

Professor <1> Thesis needs more work <2> Evolution of Quiescent Galaxies as a Function of Stellar Mass <3> MOIRCS Deep Survey

3) Using different dense layers (or embedding) for each column and concatenate them.

I've tried all three approaches, Is there any other approach I can try to improve the model accuracy? or extract, combine, join the better features?

Thanks in advance!

668

asked May 23 '20 08:05

Aaditya Ura

1 Answers

Here are some of the things you could try:

1.) Combine research_paper, book_name and comment into one string.

2.) Treat comment_by as a categorical variable and encode it using one hot encoder or label encoder.

3.) Whatever model you are using, tune the hyperparameters to improve the results.

Do let me know the results you got!

answered Oct 29 '22 07:10

spectre

Related questions
                            
                                Is this image too complex for a shallow NN classifier?
                            
                                How to interpret Singular Value Decomposition results (Python 3)?
                            
                                How do I mask the padding in a BLSTM in Keras?
                            
                                How to use Tensorflow inference models to generate deepdream like images
                            
                                Real-time analysis of event logs with Elasticsearch
                            
                                How to determine feature importance of non linear kernals in SVM
                            
                                keras: issue using ImageDataGenerator and KFold for fit_generator
                            
                                How to extract relevant information from receipt
                            
                                Gaussian Process Posterior (Python)
                            
                                python spark: narrowing down most relevant features using PCA
                            
                                How to train statsmodels.tsa.ARIMA model with multiple series
                            
                                Tensorflow, negative KL Divergence
                            
                                Is there a way to impute missing values in machine learning?
                            
                                Determine number of records in tf.data.Dataset Tensorflow [duplicate]
                            
                                Implement null distribution for gbm interaction strength
                            
                                Clustering while trying to minimise spare capacity
                            
                                Accuracy no longer improving after switching to Dataset
                            
                                LSTM time series - strange val_accuarcy, which normalizing method to use and what to do in production after model is fited
                            
                                Text generation using huggingface's distilbert models
                            
                                Universal sentence encoder for big document similarity

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With