PyTorch Huggingface BERT-NLP for Named Entity Recognition

Tags:

I have been using the PyTorch implementation of Google's BERT by HuggingFace for the MADE 1.0 dataset for quite some time now. Up until last time (11-Feb), I had been using the library and getting an F-Score of 0.81 for my Named Entity Recognition task by Fine Tuning the model. But this week when I ran the exact same code which had compiled and run earlier, it threw an error when executing this statement:

Click to copy

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

ValueError: Token indices sequence length is longer than the specified maximum sequence length for this BERT model (632 > 512). Running this sequence through BERT will result in indexing errors

The full code is available in this colab notebook.

To get around this error I modified the above statement to the one below by taking the first 512 tokens of any sequence and made the necessary changes to add the index of [SEP] to the end of the truncated/padded sequence as required by BERT.

Click to copy

input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt[:512]) for txt in tokenized_texts], maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

The result shouldn't have changed because I am only considering the first 512 tokens in the sequence and later truncating to 75 as my (MAX_LEN=75) but my F-Score has dropped to 0.40 and my precision to 0.27 while the Recall remains the same (0.85). I am unable to share the dataset as I have signed a confidentiality clause but I can assure all the preprocessing as required by BERT has been done and all extended tokens like (Johanson --> Johan ##son) have been tagged with X and replaced later after the prediction as said in the BERT Paper.

Has anyone else faced a similar issue or can elaborate on what might be the issue or what changes the PyTorch (Huggingface) people have done on their end recently?

740

asked Feb 25 '19 19:02

Ashwin Ambal

2 Answers

I've found a fix to get around this. Running the same code with pytorch-pretrained-bert==0.4.0 solves the issue and the performance is restored to normal. There's something messing with the model performance in BERT Tokenizer or BERTForTokenClassification in the new update which is affecting the model performance. Hoping that HuggingFace clears this up soon. :)

pytorch-pretrained-bert==0.4.0, Test F1-Score: 0.82

pytorch-pretrained-bert==0.6.1, Test F1-Score: 0.41

Thanks.

answered Nov 03 '22 06:11

Ashwin Ambal

I think you should use batch_encode_plus and mask output as well as the encoding.

Please see batch_encode_plus in https://huggingface.co/transformers/main_classes/tokenizer.html

answered Nov 03 '22 08:11

user2182857

Related questions
                            
                                Asynchronous REST API inside Discord.py
                            
                                pandas -- append data to series while increasing datetime index
                            
                                Pycharm: Java gateway process exited before sending its port number
                            
                                How can I categorize all columns in a data at once? (Make all values become High, Medium, Low)
                            
                                Grouping and Transforming in pandas
                            
                                Python assertItemsEqual/assertCountEqual AttributeError
                            
                                Is there a way to add python dependencies to conan package
                            
                                asyncio: sleep for sub-millisecond interval
                            
                                Decorating class methods by overriding __new__ doesn't work?
                            
                                Python3 function definition, arrow and colon [duplicate]
                            
                                How to run SQLAlchemy on AWS Lambda in Python
                            
                                Mocking a class used in a with statement
                            
                                WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally with ChromeDriver Chrome and Selenium through Python on VPS
                            
                                I use QDoubleValidator in my pyqt5 program but it doesn't seem to work
                            
                                How to add or remove a specific element from a numpy 2d array?
                            
                                how can I combine multiple sparse and dense matrices together
                            
                                Get list of running windows applications using python
                            
                                What does 'quantization' mean in interpreter.get_input_details()?
                            
                                removing "y" immediately before or after any vowel in a string in python
                            
                                Return value of __exit__

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PyTorch Huggingface BERT-NLP for Named Entity Recognition

Tags:

python

nlp

data-science

named-entity-recognition

huggingface-transformers

Ashwin Ambal

People also ask

2 Answers

Ashwin Ambal

user2182857

Recent Activity

Donate For Us