Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

Tags:

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:

Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?

397

asked Mar 01 '19 06:03

Xin

1 Answers

Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece

Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py

Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.

You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)

You may also try to experiment and provide bpe-encodings or character n-grams to the input.

It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.

146

answered Sep 19 '22 21:09

Denis Gordeev

Related questions
                            
                                How to choose the right kernel functions
                            
                                Gradient Descent: Do we iterate on ALL of the training set with each step in GD? or Do we change GD for each training set?
                            
                                How to classify URLs? what are URLs features? How to select and Extract features from URL
                            
                                Get a classification report stating the class wise precision and recall for multinomial Naive Bayes using 10 fold cross validation
                            
                                TensorFlow - why doesn't this sofmax regression learn anything?
                            
                                Python Neural Network Reinforcement Learning [closed]
                            
                                Why does support vectors in SVM have alpha (Lagrangian multiplier) greater than zero?
                            
                                Music21 Getting All notes with Durations
                            
                                Tensorflow: why is zip() function used in the steps involving applying the gradients?
                            
                                How does keras(or any other ML framework) calculate the gradient of a lambda function layer for backpropagation?
                            
                                wit.ai: how does it identify intent and classifies entities from user expressions
                            
                                ValueError: Input 0 is incompatible with layer conv_1: expected ndim=3, found ndim=4
                            
                                Weird accuracy in multilabel classification keras
                            
                                legacy_init_op in TensorFlow Serving
                            
                                Multidimensional Input to Keras
                            
                                What is "linear projection" in convolutional neural network
                            
                                How to use the function merge and switch of tensorflow?
                            
                                How to get results from custom loss function in Keras?
                            
                                Understanding decision_function values
                            
                                4D input in LSTM layer in Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

Tags:

machine-learning

nlp

natural-language-processing

pre-trained-model

transfer-learning

Xin

People also ask

1 Answers

Denis Gordeev

Recent Activity

Donate For Us