Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wordpiece tokenization versus conventional lemmatization?

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do so. I noticed BERT uses WordPiece tokenization (for example, "playing" -> "play" + "##ing").

Right now, I have my text preprocessed using a standard tokenizer that splits on spaces / some punctuation, and then I have a lemmatizer ("playing" ->"play"). I'm wondering what the benefit of WordPiece tokenization is over a standard tokenization + lemmatization. I know WordPiece helps with out of vocabulary words, but is there anything else? That is, even if I don't end up using BERT, should I consider replacing my tokenizer + lemmatizer with wordpiece tokenization? In what situations would that be useful?

like image 918
Keshinko Avatar asked Jul 16 '19 13:07

Keshinko


People also ask

What is WordPiece tokenization?

One such subword tokenization technique that is commonly used and can be applied to many other NLP models is called WordPiece. Given text, WordPiece first pre-tokenizes the text into words (by splitting on punctuation and whitespaces) and then tokenizes each word into subword units, called wordpieces.

What is Lemmatization and tokenization?

Lemmatization is the process where we take individual tokens from a sentence and we try to reduce them to their base form. The process that makes this possible is having a vocabulary and performing morphological analysis to remove inflectional endings.

What is WordPiece Embeddings?

Word Piece embeddings was developed for google speech recognition system for Asian languages like Korean and Japanese. These languages have large inventory of characters, homonyms and no or few spaces between words. No or fewer spaces meant segmentation was necessary for the text.

What is tokenization in NLP?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. Key points of the article – Text into sentences tokenization. Sentences into words tokenization.


Video Answer


1 Answers

The word-piece tokenization helps in multiple ways, and should be better than lemmatizer. Due to multiple reasons:

  1. If you have the words 'playful', 'playing', 'played', to be lemmatized to 'play', it can lose some information such as playing is present-tense and played is past-tense, which doesn't happen in word-piece tokenization.
  2. Word piece tokens cover all the word, even the words that do not occur in the dictionary. It splits the words and there will be word-piece tokens, that way, you shall have embeddings for the split word-pieces, unlike removing the words or replacing with 'unknown' token.

Usage of word-piece tokenization instead of tokenizer+lemmatizer is merely a design choice, word-piece tokenization should perform well. But you may have to take into count because word-piece tokenization increases the number of tokens, which is not the case in lemmatization.

like image 57
Ashwin Geet D'Sa Avatar answered Nov 26 '22 18:11

Ashwin Geet D'Sa