Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)

I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:

  • Does the Bert and ELMo models have standard data preprocessing steps or standard processed data formats?
  • Based on my existing dirty data, is there any way to preprocess this data so that the resulting word representation is more accurate?
like image 397
Xin Avatar asked Mar 01 '19 06:03

Xin


People also ask

What preprocessing is required for BERT?

Preprocessing is not needed when using pre-trained language representation models like BERT. In particular, it uses all of the information in a sentence, even punctuation and stop-words, from a wide range of perspectives by leveraging a multi-head self attention mechanism. Hope it helps.

What is BERT and ELMo?

BERT and GPT are transformer-based architecture while ELMo is Bi-LSTM Language model. BERT is purely Bi-directional, GPT is unidirectional and ELMo is semi-bidirectional. GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).

Which is better ELMo or BERT?

Truly BidirectionalBERT is deeply bidirectional due to its novel masked language modeling technique. ELMo on the other hand uses an concatenation of right-to-left and left-to-right LSTMs and ULMFit uses a unidirectional LSTM. Having bidirectional context should, in theory, generate more accurate word representations.


1 Answers

Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece

Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py

Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.

You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)

You may also try to experiment and provide bpe-encodings or character n-grams to the input.

It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.

like image 146
Denis Gordeev Avatar answered Sep 19 '22 21:09

Denis Gordeev