Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet.
What I have known so far is that it enables NMT model translation on open-vocabulary by encoding rare and unknown words as sequences of subword units.
But I want to get a general idea of how it works without going through the paper.
Byte-Pair Encoding (BPE) BPE is a simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data. It was first described in the article “A New Algorithm for Data Compression” published in 1994.
Byte Pair Encoding (BPE) Algorithm Here's how it works: Add an identifier ( </w> ) at the end of each word to identify the end of a word and then calculate the word frequency in the text. Split the word into characters and then calculate the character frequency.
Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It's used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa. HuggingFace.
Byte-Level Text RepresentationIn UTF-8 encoding, each character is encoded into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138,000 unicode characters, a sentence can be represented as a sequence of UTF-8 bytes (248 out of 256 possible bytes).
Byte Pair Encoding in NLP an intermediated solution to reduce the vocabulary size when compared with word based tokens, and to cover as many frequently occurring sequence of characters in a single token without representing using long character based tokens.
Initially each character is referred by a token, and the number of merge operations is a hyper-parameter to build the BPE vocabulary.
Let us consider the sentence:
'he had a cat' and 'the cat is sitting on the mat'
We can build the vocabulary of characters as follows:
['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't']
Now we can list the pairs and its occurrences:
he: 3 (he, the*2)
ha: 1 (had)
ad: 1 (had)
ca: 2 (cat*2)
at: 3 (cat*2, mat)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
ma: 1
Now, as the pairs 'he' and 'at' occurs most frequently in the vocabulary, it can be combined(merge operation) and added to the vocabulary.
Updated vocabulary: ['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at']
Now, if the longer parts of words that occur in the vocabulary can be referred using a single token, i.e, 'he' or 'at' is referred using single tokens instead of two character tokens.
Thus: the sentences:
can be tokenized as follows:
'he'
, 'h','a','d', 'a', 'c', 'at'
]'he'
, 'c','at'
, 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he'
, 'm', 'at'
]Further, this processed can be repeated by identifying the most frequently occuring pairs:
ha: 1 (had)
ad: 1 (had)
c'at': 2 (cat*2)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
m'at': 1
Since, the word c'at'
, occurs most frequently, it can be combined to form a new vocabulary as below:
['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at', 'cat']
Thus the new tokenization:
'he'
, 'h','a','d', 'a', 'cat'
]'he'
, 'cat'
, 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he'
, 'm', 'at'
]Thus, with the higher number of merge operations, the vocabulary size increases, but the number of tokens used to represent a given text reduces.
BPE is one of the three algorithms to deal with the unknown word problem(or languages with rich morphology that require dealing with structure below the word level) in an automatic way: byte-pair encoding, unigram language modeling, WordPiece, and the BPE tokenization schema has two parts: a token learner and a token segmenter.
The token learner takes a raw training corpus (sometimes roughly pre- separated into words, for example by whitespace) and induces a vocabulary, a set of tokens. The token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary.
The algorithm has a parameter k which means k merges or k new symbols in the final vocabulary.
Let's train the BPE using this line: Pen Penapple Apple Pen
adapted from PPAP, and show how the unknown/rare "words" penapplepen
and applepen
in the test data can be automatically reduced into known subword units.
Firstly, after some preprocessing(case mapping, regex based pre-tokenization, and adding end-of-word symbol _) we obtain the following strings C(as our corpus) and their frequencies(frequency: string):
2: p e n _
1: p e n a p p l e _
1: a p p l e _
The vocabulary V is [_, p, e, n, a, l]
Now let's run the first round of the for-loop in the above pseudocode:
p, e <- most frequent pair in {(p, e): 3, (e, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}
pe <- p + e
[_, p, e, n, a, l, pe] <- [_, p, e, n, a, l] + pe
C becomes this:
2: pe n _
1: pe n a p p l e _
1: a p p l e _
Let's run the second merge as follows:
p, e <- most frequent pair in {(pe, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}
pen <- pe + n
[_, p, e, n, a, l, pe, pen] <- [_, p, e, n, a, l, pe] + pen
C becomes this:
2: pen _
1: pen a p p l e _
1: a p p l e _
And here are the next merges if we take k as N >= 9:
Merge Current V
(pen, _) [_, p, e, n, a, l, pe, pen, pen_]
(a, p) [_, p, e, n, a, l, pe, pen, pen_, ap]
(ap, p) [_, p, e, n, a, l, pe, pen, pen_, ap, app]
(app, l) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl]
(appl, e) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple]
(apple, _) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_]
(pen, apple_) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_, penapple_]
We see that after 9 iterations of merging, there are no adjacent pairs in C.
The token parser just runs on the test data the merges we have learned from the training data, greedily, in the order we learned them. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).
In this step we test the parser using this sentence: Applepen PenapplePen
. As usual, we do the preprocessing we did in the training step and obtain:
a p p l e p e n _
p e n a p p l e p e n _
and follow the merging order:
(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)
First, (p, e). We merge p and e in the test data and get:
a p p l e pe n _
pe n a p p l e pe n _
Second, the (pe, n) and get:
a p p l e pen _
pen a p p l e pen _
.....
After all the 9 turns of merging we get(if k <= 9 we just apply the first k merges in this step; if k is 2 refer to this answer):
apple pen_
pen apple pen_
And the final tokenized test sentence is [apple, pen_, pen, apple, pen_], and the unknown(unseen in the training data) word penapplepen
can also be separated.
Referneces:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With