We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?

You have basically three options: <ol> <li>You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.</li> <li>You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive. </li> <li>You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.</li> </ol> I would suggest to try option 1, and only if this is not good enough to consider the other options.

How to use Bert for long text classification?

1 Answers

You have basically three options:

You cut the longer texts off and only use the first 512 Tokens. The original BERT implementation (and probably the others as well) truncates longer sequences automatically. For most cases, this option is sufficient.
You can split your text in multiple subtexts, classifier each of them and combine the results back together ( choose the class which was predicted for most of the subtexts for example). This option is obviously more expensive.
You can even feed the output token for each subtext (as in option 2) to another network (but you won't be able to fine-tune) as described in this discussion.

I would suggest to try option 1, and only if this is not good enough to consider the other options.

199

answered Oct 15 '22 19:10

chefhose

Related questions
                            
                                NLTK and language detection
                            
                                How do I do dependency parsing in NLTK?
                            
                                NLTK WordNet Lemmatizer: Shouldn't it lemmatize all inflections of a word?
                            
                                Code Golf: Number to Words
                            
                                Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score
                            
                                Restore original text from Keras’s imdb dataset
                            
                                How to tweak the NLTK sentence tokenizer
                            
                                How to connect Cortana commands to custom scripts?
                            
                                Doc2Vec Get most similar documents
                            
                                Use of PunktSentenceTokenizer in NLTK
                            
                                TFIDF for Large Dataset
                            
                                What are good starting points for someone interested in natural language processing? [closed]
                            
                                How to extract phrases from corpus using gensim
                            
                                How to detect language of user entered text? [closed]
                            
                                Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?
                            
                                Machine Learning and Natural Language Processing [closed]
                            
                                Entity Extraction/Recognition with free tools while feeding Lucene Index
                            
                                How to use Gensim doc2vec with pre-trained word vectors?
                            
                                Algorithms to detect phrases and keywords from text
                            
                                Load Pretrained glove vectors in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use Bert for long text classification?

Tags:

nlp

text-classification

bert-language-model

user1337896

People also ask

1 Answers

chefhose

Recent Activity

Donate For Us