Shall we lower case input data for (pre) training a BERT uncased model using huggingface? I looked into this response from Thomas Wolf (https://github.com/huggingface/transformers/issues/92#issuecomment-444677920) but not entirely sure if he meant that.
What happens if we lowercase the text ?
In BERT uncased, the text has been lowercased before WordPiece tokenization step while in BERT cased, the text is same as the input text (no changes). For example, if the input is "OpenGenus", then it is converted to "opengenus" for BERT uncased while BERT cased takes in "OpenGenus".
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%.
You can use the same tokenizer for all of the various BERT models that hugging face provides. As BERT can only accept/take as input only 512 tokens at a time, we must specify the truncation parameter to True. The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens.
The input consists of a pair of sentences, called sequences, and two special tokens: [CLS] and [SEP]. The wordpiece tokenization used in BERT necessarily breaks words like playing into “play” and “##ing”.
For this purpose, I will be using BERT as a reference model. To perform pre-training, the data must be in a specific format. It should be in a text file (.txt format) with one sentence per line. The purpose of this text file is first to tokenize the data using Word Piece tokenizer and then perform pre-training on the data.
What is BERT? BERT stands for Bidirectional Encoder Representations from Transformers and is a language representation model by Google. It uses two steps, pre-training and fine-tuning, to create state-of-the-art models for a wide range of tasks.
The input representation for BERT. Source: The paper. The model needs to take input for both a single sentence or two sentences packed together unambiguously in one token sequence. Authors note that a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence.
The output of Bert model contains the vector of size (hidden size) and the first position in the output is the [CLS] token. Now, this output can be used as an input to our classifier neural network for classification of the toxicity of the words.
Tokenizer will take care of that.
A simple example:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_length = 10, padding_side = 'right')
input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)
input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)
Out:
tensor([[ 101, 2023, 2003, 1037, 4937, 102, 0, 0, 0, 0]])
tensor([[ 101, 2023, 2003, 1037, 4937, 102, 0, 0, 0, 0]])
But in case of cased,
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', max_length = 10, padding_side = 'right')
input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)
input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)
tensor([[ 101, 1142, 1110, 170, 5855, 102, 0, 0, 0, 0]])
tensor([[ 101, 1188, 1110, 170, 8572, 102, 0, 0, 0, 0]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With