I plan to train ELMo or Bert model from scratch based on data(notes typed by people) on hand. The data I have now is all typed by different people. There are problems with spelling, formatting, and inconsistencies in sentences. After read the ELMo and Bert papers, I know that both models use a lot of sentences like from Wikipedia. I haven't been able to find any processed training samples or any preprocessing tutorial for Emlo or Bert model. My question is:
Preprocessing is not needed when using pre-trained language representation models like BERT. In particular, it uses all of the information in a sentence, even punctuation and stop-words, from a wide range of perspectives by leveraging a multi-head self attention mechanism. Hope it helps.
BERT and GPT are transformer-based architecture while ELMo is Bi-LSTM Language model. BERT is purely Bi-directional, GPT is unidirectional and ELMo is semi-bidirectional. GPT is trained on the BooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) and Wikipedia (2,500M words).
Truly BidirectionalBERT is deeply bidirectional due to its novel masked language modeling technique. ELMo on the other hand uses an concatenation of right-to-left and left-to-right LSTMs and ULMFit uses a unidirectional LSTM. Having bidirectional context should, in theory, generate more accurate word representations.
Bert uses WordPiece embeddings which somewhat helps with dirty data. https://github.com/google/sentencepiece
Also Google-Research provides data preprocessing in their code. https://github.com/google-research/bert/blob/master/tokenization.py
Default Elmo implementation takes tokens as the output (if you provide an untokenized string, it will split it on spaces). Thus spelling correction, deduplication, lemmatization (e.g. as in spacy https://spacy.io/api/lemmatizer), separating tokens from punctuation and other standard preprocessing methods may help.
You may check standard ways to preprocess text in the NLTK package. https://www.nltk.org/api/nltk.tokenize.html (for example Twitter tokenizer). (Beware that NLTK is slow by itself). Many machine learning libraries provide their basic preprocessing (https://github.com/facebookresearch/pytext https://keras.io/preprocessing/text/)
You may also try to experiment and provide bpe-encodings or character n-grams to the input.
It also depends on the amount of data that you have; the more data you have, the less is the benefit of preprocessing (in my opinion). Given that you want to train Elmo or Bert from scratch, you should have a lot of data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With