Whats the best method to use the words itself as the features in any machine learning algorithm ? The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ? In general, How are words itself used as features in NLP ?

There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification: <ul> <li>a Boolean field which encodes the presence or absence of that word in a given document;</li> <li>a frequency histogram of a predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the last paragraph of this Answer);</li> <li>the juxtaposition of two or more words (e.g., 'alternative' and 'lifestyle' in consecutive order have a meaning not related either component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;</li> <li>words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.</li> </ul> A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.

Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur. This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK. "Index in the dictionary" is a useless feature, I wouldn't use it.

How to include words as numerical feature in classification

2 Answers

There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:

a Boolean field which encodes the presence or absence of that word in a given document;
a frequency histogram of a predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the last paragraph of this Answer);
the juxtaposition of two or more words (e.g., 'alternative' and 'lifestyle' in consecutive order have a meaning not related either component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;
words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.

A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.

answered Nov 15 '22 09:11

doug

Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur.

This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK.

"Index in the dictionary" is a useless feature, I wouldn't use it.

answered Nov 15 '22 09:11

Yaroslav Bulatov

Related questions
                            
                                Write custom classifier in R and predict function
                            
                                Implementation of Gaussian Process Regression in Python y(n_samples, n_targets)
                            
                                Understanding influence of random start weights on neural network performance
                            
                                Cross Validation--Use testing set or validation set to predict?
                            
                                Is there a momentum option for Adam optimizer in Keras? [closed]
                            
                                How to give variable size images as input in keras
                            
                                Tensorflow flatten vs numpy flatten function effect on machine learning training
                            
                                "TypeError: Singleton array cannot be considered a valid collection" using sklearn train_test_split
                            
                                How to classify sequence of images with keras deep learning
                            
                                Gridsearchcv vs Bayesian optimization
                            
                                How to combine False positives and false negatives into one single measure
                            
                                A question about classifiers in Machine Learning
                            
                                Information gain on non discrete dataset
                            
                                Filling NAN data with mode() doesn't work -Pandas
                            
                                Why do we use fully-connected layer at the end of CNN?
                            
                                Tensorflow model import to Java
                            
                                Python Machine Learning Functions [closed]
                            
                                How to use very large dataset in RNN TensorFlow?
                            
                                RNN: What is the use of return_sequences in LSTM layer in Keras Framework
                            
                                Concat tensors in PyTorch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to include words as numerical feature in classification

Tags:

machine-learning

classification

nlp

document-classification

AlgoMan

People also ask

2 Answers

doug

Yaroslav Bulatov

Recent Activity

Donate For Us