Wordpiece tokenization versus conventional lemmatization?

Tags:

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?

918

asked Jul 16 '19 13:07

Keshinko

Video Answer

1 Answers

The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:

If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as playing is present-tense and played is past-tense, which doesn't happen in word-piece tokenization.
Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.

Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.

answered Nov 26 '22 18:11

Ashwin Geet D'Sa

Related questions
                            
                                Get noun from verb Wordnet
                            
                                Choosing appropriate sense of a word from wordnet
                            
                                SyntaxNet creating tree to root verb
                            
                                Collocations with spaCy
                            
                                Named entity recognition (NER) features
                            
                                Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?
                            
                                Why use cosine similarity in Word2Vec when its trained using dot-product similarity
                            
                                How do I train gpt 2 from scratch?
                            
                                Decoding Permutated English Strings
                            
                                Using my own corpus for category classification in Python NLTK
                            
                                Parsing words into (prefix, root, suffix) in Python
                            
                                Conduit: Multiple Stream Consumers
                            
                                Approach for identifying whether a sentence includes an imperative within it
                            
                                nltk StanfordNERTagger : How to get proper nouns without capitalization
                            
                                Machine Learning - Information extraction from a document [closed]
                            
                                Classify words with the same meaning
                            
                                hierarchical classification in sklearn [closed]
                            
                                How to inverse lemmatization process given a lemma and a token?
                            
                                How does Pyspark Calculate Doc2Vec from word2vec word embeddings?
                            
                                How to build a gensim dictionary that includes bigrams?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Wordpiece tokenization versus conventional lemmatization?

Tags:

tokenize

nlp

lemmatization

Keshinko

People also ask

Video Answer

1 Answers

Ashwin Geet D'Sa

Recent Activity

Donate For Us