Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shall we lower case input data for (pre) training a BERT uncased model using huggingface?

Shall we lower case input data for (pre) training a BERT uncased model using huggingface? I looked into this response from Thomas Wolf (https://github.com/huggingface/transformers/issues/92#issuecomment-444677920) but not entirely sure if he meant that.

What happens if we lowercase the text ?

like image 311
CARTman Avatar asked Jun 19 '20 08:06

CARTman


People also ask

What is cased and uncased in BERT?

In BERT uncased, the text has been lowercased before WordPiece tokenization step while in BERT cased, the text is same as the input text (no changes). For example, if the input is "OpenGenus", then it is converted to "opengenus" for BERT uncased while BERT cased takes in "OpenGenus".

What is BERT base uncased trained on?

The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%.

How do you use the hugging face BERT model?

You can use the same tokenizer for all of the various BERT models that hugging face provides. As BERT can only accept/take as input only 512 tokens at a time, we must specify the truncation parameter to True. The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens.

What are the inputs to BERT?

The input consists of a pair of sentences, called sequences, and two special tokens: [CLS] and [SEP]. The wordpiece tokenization used in BERT necessarily breaks words like playing into “play” and “##ing”.

How to use Bert as a reference model for pre-training?

For this purpose, I will be using BERT as a reference model. To perform pre-training, the data must be in a specific format. It should be in a text file (.txt format) with one sentence per line. The purpose of this text file is first to tokenize the data using Word Piece tokenizer and then perform pre-training on the data.

What is a Bert model?

What is BERT? BERT stands for Bidirectional Encoder Representations from Transformers and is a language representation model by Google. It uses two steps, pre-training and fine-tuning, to create state-of-the-art models for a wide range of tasks.

What is the input representation for Bert?

The input representation for BERT. Source: The paper. The model needs to take input for both a single sentence or two sentences packed together unambiguously in one token sequence. Authors note that a “sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence.

What is the output of Bert model in machine learning?

The output of Bert model contains the vector of size (hidden size) and the first position in the output is the [CLS] token. Now, this output can be used as an input to our classifier neural network for classification of the toxicity of the words.


1 Answers

Tokenizer will take care of that.

A simple example:

import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

Out:

tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])
tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])

But in case of cased,

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)
tensor([[ 101, 1142, 1110,  170, 5855,  102,    0,    0,    0,    0]])

tensor([[ 101, 1188, 1110,  170, 8572,  102,    0,    0,    0,    0]])
like image 54
Zabir Al Nazi Avatar answered Oct 07 '22 09:10

Zabir Al Nazi