Shall we lower case input data for (pre) training a BERT uncased model using huggingface?

Tags:

Shall we lower case input data for (pre) training a BERT uncased model using huggingface? I looked into this response from Thomas Wolf (https://github.com/huggingface/transformers/issues/92#issuecomment-444677920) but not entirely sure if he meant that.

What happens if we lowercase the text ?

311

asked Jun 19 '20 08:06

CARTman

1 Answers

Tokenizer will take care of that.

A simple example:

import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

Out:

tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])
tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])

But in case of cased,

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

tensor([[ 101, 1142, 1110,  170, 5855,  102,    0,    0,    0,    0]])

tensor([[ 101, 1188, 1110,  170, 8572,  102,    0,    0,    0,    0]])

answered Oct 07 '22 09:10

Zabir Al Nazi

Related questions
                            
                                How to use the vgg-net when I load vgg16_weights.h5?
                            
                                Accessing neural network weights and neuron activations
                            
                                What is the difference between classification and pattern recognition?
                            
                                How to split a tensor column-wise in Keras to implement STFCN
                            
                                patch-wise training and fully convolutional training in FCN
                            
                                How to asynchronously load and train batches to train a DeepLearning model?
                            
                                Keras: model accuracy drops after reaching 99 percent accuracy and loss 0.01
                            
                                How can I improve the classification accuracy of LSTM,GRU recurrent neural networks
                            
                                ssh AWS, Jupyter Notebook not showing up on web browser
                            
                                How to properly feed specific tensor to keras model
                            
                                Store Tensorflow object detection API image output with boxes in CSV format
                            
                                How to overcome overfitting in CNN - standard methods don't work
                            
                                Mini batch training for inputs of variable sizes
                            
                                Variable size input for LSTM in Pytorch
                            
                                How to get all layers' activations for a specific input for Tensorflow Hub modules?
                            
                                How to train Siamese network in Keras?
                            
                                Using different sample weights for each output in a multi-output Keras model
                            
                                Difference between feature_column.embedding_column and keras.layers.Embedding in TensorFlow
                            
                                How to add a new class to an existing classifier in deep learning?
                            
                                Pretraining a language model on a small custom corpus

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Shall we lower case input data for (pre) training a BERT uncased model using huggingface?

Tags:

deep-learning

nlp

pytorch

huggingface-transformers

CARTman

People also ask

1 Answers

Zabir Al Nazi

Recent Activity

Donate For Us