How to preprocess text for embedding?

Tags:

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.

I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).

My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.

I've found some guidance online but I'm still curious to know what the community here thinks:

Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?

My thoughts:

It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.

Does this sound right?

999

asked May 31 '17 18:05

xv70

3 Answers

I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect. I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.

I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.

I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis by Jose Camacho-Collados and Mohammad Taher Pilehvar

https://arxiv.org/pdf/1707.01780.pdf

Here is a quote from their conclusion:

"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."

answered Oct 21 '22 20:10

Carsten

So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).

answered Oct 21 '22 20:10

Josep Valls

I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

answered Oct 21 '22 22:10

Aleksandar Savkov

Related questions
                            
                                Package ‘neuralnet’ in R, rectified linear unit (ReLU) activation function?
                            
                                Average weights in keras models
                            
                                Interpreting a Self Organizing Map
                            
                                Regularization for LSTM in tensorflow
                            
                                The correctness of neural networks
                            
                                Neural Network Categorical Data Implementation
                            
                                Neural Network Ordinal Classification for Age
                            
                                Stop Training in Keras when Accuracy is already 1.0
                            
                                What does "sparse" mean in the context of neural nets?
                            
                                Xavier and he_normal initialization difference
                            
                                Speeding up Math calculations in Java
                            
                                Trouble fitting simple data with MLPRegressor
                            
                                How do you decide the parameters of a Convolutional Neural Network for image classification?
                            
                                pytorch error: multi-target not supported in CrossEntropyLoss()
                            
                                Keras: Lambda layer function with multiple parameters
                            
                                Training darknet finishes immediately
                            
                                How to create a neural network for regression?
                            
                                How to improve accuracy of a FeedForward Neural Network?
                            
                                Computational Complexity of Self-Attention in the Transformer Model
                            
                                Why use a restricted Boltzmann machine rather than a multi-layer perceptron?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With