I have been using the PyTorch implementation of Google's BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
ValueError: Token indices sequence length is longer than the specified maximum sequence length for this BERT model (632 > 512). Running this sequence through BERT will result in indexing errors
The full code is available in this colab notebook.
To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
The result shouldn't have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson --> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.
Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) people have done on their end recently?
In BERT, the id 101 is reserved for the special [CLS] token, the id 102 is reserved for the special [SEP] token, and the id 0 is reserved for [PAD] token. token_type_ids : To identify the sequence in which a token belongs to. Since we only have one sequence per text, then all the values of token_type_ids will be 0.
mask_token is the masking token [MASK] . no_mask_tokens is a list of tokens that should not be masked. This is useful if we are training the MLM with another task like classification at the same time, and we have tokens such as [CLS] that shouldn't be masked.
bert-base-ner is a fine-tuned BERT model that is ready to use for Named Entity Recognition and achieves state-of-the-art performance for the NER task. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).
I've found a fix to get around this. Running the same code with pytorch-pretrained-bert==0.4.0 solves the issue and the performance is restored to normal. There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :)
pytorch-pretrained-bert==0.4.0, Test F1-Score: 0.82
pytorch-pretrained-bert==0.6.1, Test F1-Score: 0.41
Thanks.
I think you should use batch_encode_plus
and mask output as well as the encoding.
Please see batch_encode_plus in https://huggingface.co/transformers/main_classes/tokenizer.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With