Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet. What I have known so far is that it enables NMT model translation on open-vocabulary by encoding rare and unknown words as sequences of subword units. But I want to get a general idea of how it works without going through the paper.

BPE is one of the three algorithms to deal with the unknown word problem(or languages with rich morphology that require dealing with structure below the word level) in an automatic way: byte-pair encoding, unigram language modeling, WordPiece, and the BPE tokenization schema has two parts: a token learner and a token segmenter. <blockquote> The token learner takes a raw training corpus (sometimes roughly pre- separated into words, for example by whitespace) and induces a vocabulary, a set of tokens. The token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary. </blockquote> The algorithm has a parameter k which means k merges or k new symbols in the final vocabulary. Let's train the BPE using this line: <code>Pen Penapple Apple Pen</code> adapted from PPAP, and show how the unknown/rare "words" <code>penapplepen</code> and <code>applepen</code> in the test data can be automatically reduced into known subword units. <h3>Learning</h3> <img src="https://i.stack.imgur.com/ExQa4.png" alt="enter image description here"> Firstly, after some preprocessing(case mapping, regex based pre-tokenization, and adding end-of-word symbol _) we obtain the following strings C(as our corpus) and their frequencies(frequency: string): <pre class="prettyprint"><code>2: p e n _ 1: p e n a p p l e _ 1: a p p l e _ </code></pre> The vocabulary V is [_, p, e, n, a, l] Now let's run the first round of the for-loop in the above pseudocode: <pre class="prettyprint"><code>p, e <- most frequent pair in {(p, e): 3, (e, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1} pe <- p + e [_, p, e, n, a, l, pe] <- [_, p, e, n, a, l] + pe C becomes this: 2: pe n _ 1: pe n a p p l e _ 1: a p p l e _ </code></pre> Let's run the second merge as follows: <pre class="prettyprint"><code>p, e <- most frequent pair in {(pe, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1} pen <- pe + n [_, p, e, n, a, l, pe, pen] <- [_, p, e, n, a, l, pe] + pen C becomes this: 2: pen _ 1: pen a p p l e _ 1: a p p l e _ </code></pre> And here are the next merges if we take k as N >= 9: <pre class="prettyprint"><code>Merge Current V (pen, _) [_, p, e, n, a, l, pe, pen, pen_] (a, p) [_, p, e, n, a, l, pe, pen, pen_, ap] (ap, p) [_, p, e, n, a, l, pe, pen, pen_, ap, app] (app, l) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl] (appl, e) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple] (apple, _) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_] (pen, apple_) [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_, penapple_] </code></pre> We see that after 9 iterations of merging, there are no adjacent pairs in C. <h3>Parsing</h3> <blockquote> The token parser just runs on the test data the merges we have learned from the training data, greedily, in the order we learned them. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data). </blockquote> In this step we test the parser using this sentence: <code>Applepen PenapplePen</code>. As usual, we do the preprocessing we did in the training step and obtain: <pre class="prettyprint"><code>a p p l e p e n _ p e n a p p l e p e n _ </code></pre> and follow the merging order: <pre class="prettyprint"><code>(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_) </code></pre> First, (p, e). We merge p and e in the test data and get: <pre class="prettyprint"><code>a p p l e pe n _ pe n a p p l e pe n _ </code></pre> Second, the (pe, n) and get: <pre class="prettyprint"><code>a p p l e pen _ pen a p p l e pen _ </code></pre> ..... After all the 9 turns of merging we get(if k <= 9 we just apply the first k merges in this step; if k is 2 refer to this answer): <pre class="prettyprint"><code>apple pen_ pen apple pen_ </code></pre> And the final tokenized test sentence is [apple, pen_, pen, apple, pen_], and the unknown(unseen in the training data) word <code>penapplepen</code> can also be separated. Referneces: <ol> <li>Neural Machine Translation of Rare Words with Subword Units</li> <li>Speech and Language Processing</li> <li>PPAP (Pen Pineapple Apple Pen)</li> </ol>

Explain bpe (Byte Pair Encoding) with examples?

2 Answers

Byte Pair Encoding in NLP an intermediated solution to reduce the vocabulary size when compared with word based tokens, and to cover as many frequently occurring sequence of characters in a single token without representing using long character based tokens.

Initially each character is referred by a token, and the number of merge operations is a hyper-parameter to build the BPE vocabulary.

Let us consider the sentence:

'he had a cat' and 'the cat is sitting on the mat'

We can build the vocabulary of characters as follows:

['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't']

Now we can list the pairs and its occurrences:

he: 3 (he, the*2)
ha: 1 (had)
ad: 1 (had)
ca: 2 (cat*2)
at: 3 (cat*2, mat)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
ma: 1

Now, as the pairs 'he' and 'at' occurs most frequently in the vocabulary, it can be combined(merge operation) and added to the vocabulary.

Updated vocabulary: ['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at']

Now, if the longer parts of words that occur in the vocabulary can be referred using a single token, i.e, 'he' or 'at' is referred using single tokens instead of two character tokens.

Thus: the sentences:

'he had a cat' and 2. 'the cat is sitting on the mat'

can be tokenized as follows:

['he', 'h','a','d', 'a', 'c', 'at']
['t','he', 'c','at', 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he', 'm', 'at']

Further, this processed can be repeated by identifying the most frequently occuring pairs:

ha: 1 (had)
ad: 1 (had)
c'at': 2 (cat*2)
th: 2
is: 1
si: 1
it: 1
ti: 1
in: 1
ng: 1
on: 1
m'at': 1

Since, the word c'at', occurs most frequently, it can be combined to form a new vocabulary as below:

['a', 'c', 'd', 'e', 'g', 'h', 'i', 'm', 'n', 'o', 's', 't', 'he', 'at', 'cat']

Thus the new tokenization:

['he', 'h','a','d', 'a', 'cat']
['t','he', 'cat', 'i', 's', 's', 'i','t','t','i','n','g', 'o','n', 't', 'he', 'm', 'at']

Thus, with the higher number of merge operations, the vocabulary size increases, but the number of tokens used to represent a given text reduces.

174

answered Sep 21 '22 11:09

Ashwin Geet D'Sa

BPE is one of the three algorithms to deal with the unknown word problem(or languages with rich morphology that require dealing with structure below the word level) in an automatic way: byte-pair encoding, unigram language modeling, WordPiece, and the BPE tokenization schema has two parts: a token learner and a token segmenter.

The token learner takes a raw training corpus (sometimes roughly pre- separated into words, for example by whitespace) and induces a vocabulary, a set of tokens. The token segmenter takes a raw test sentence and segments it into the tokens in the vocabulary.

The algorithm has a parameter k which means k merges or k new symbols in the final vocabulary.

Let's train the BPE using this line: Pen Penapple Apple Pen adapted from PPAP, and show how the unknown/rare "words" penapplepen and applepen in the test data can be automatically reduced into known subword units.

Learning

enter image description here

Firstly, after some preprocessing(case mapping, regex based pre-tokenization, and adding end-of-word symbol _) we obtain the following strings C(as our corpus) and their frequencies(frequency: string):

2: p e n _    
1: p e n a p p l e _  
1: a p p l e _

The vocabulary V is [_, p, e, n, a, l]

Now let's run the first round of the for-loop in the above pseudocode:

p, e <- most frequent pair in {(p, e): 3, (e, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}    
pe <- p + e   
[_, p, e, n, a, l, pe] <- [_, p, e, n, a, l] + pe    
C becomes this: 
    2: pe n _    
    1: pe n a p p l e _  
    1: a p p l e _

Let's run the second merge as follows:

p, e <- most frequent pair in {(pe, n): 3, (n, _): 2, (a, p): 2, (p, p): 2, (p, l): 2, (l, e): 2, (e, _): 2, (n, a): 1}    
pen <- pe + n   
[_, p, e, n, a, l, pe, pen] <- [_, p, e, n, a, l, pe] + pen    
C becomes this: 
    2: pen _    
    1: pen a p p l e _  
    1: a p p l e _

And here are the next merges if we take k as N >= 9:

Merge          Current V 
(pen, _)       [_, p, e, n, a, l, pe, pen, pen_]  
(a, p)         [_, p, e, n, a, l, pe, pen, pen_, ap]
(ap, p)        [_, p, e, n, a, l, pe, pen, pen_, ap, app]
(app, l)       [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl]
(appl, e)      [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple]
(apple, _)     [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_]
(pen, apple_)  [_, p, e, n, a, l, pe, pen, pen_, ap, app, appl, apple, apple_, penapple_]

We see that after 9 iterations of merging, there are no adjacent pairs in C.

Parsing

The token parser just runs on the test data the merges we have learned from the training data, greedily, in the order we learned them. (Thus the frequencies in the test data don’t play a role, just the frequencies in the training data).

In this step we test the parser using this sentence: Applepen PenapplePen. As usual, we do the preprocessing we did in the training step and obtain:

a p p l e p e n _   
p e n a p p l e p e n _

and follow the merging order:

(p, e), (pe, n), (pen, _), (a, p), (ap, p), (app, l), (appl, e), (apple, _), (pen, apple_)

First, (p, e). We merge p and e in the test data and get:

a p p l e pe n _    
pe n a p p l e pe n _

Second, the (pe, n) and get:

a p p l e pen _    
pen a p p l e pen _

.....
After all the 9 turns of merging we get(if k <= 9 we just apply the first k merges in this step; if k is 2 refer to this answer):

apple pen_    
pen apple pen_

And the final tokenized test sentence is [apple, pen_, pen, apple, pen_], and the unknown(unseen in the training data) word penapplepen can also be separated.

Referneces:

Neural Machine Translation of Rare Words with Subword Units
Speech and Language Processing
PPAP (Pen Pineapple Apple Pen)

answered Sep 22 '22 11:09

Lerner Zhang

Related questions
                            
                                How does youtube calculate the unique 11 digit code for each video [closed]
                            
                                Print a Tree Vertically
                            
                                Algorithm for finding kth binary number with certain properties
                            
                                Faster second-best MST algorithm?
                            
                                What are the design decisions behind Google Maps encoded polyline algorithm format?
                            
                                Strange behaviour of ant colony algorithm
                            
                                Efficient algorithm to compute the median of pariwise absolute sums of a sorted array
                            
                                Why is Java List traversal slower than file readline?
                            
                                Are STL algorithms optimized for speed?
                            
                                An algorithm for iterating over a rectangular area inside a 1 dimensional array (bitmapping)
                            
                                How to find all equals paths in degenerate tree, which start on specific vertex?
                            
                                Longest chain that can be arranged
                            
                                Efficiently find every combination of assigning smaller bins to larger bins
                            
                                Efficient reduction of multiple tensors in Python
                            
                                Vectorizable implementation of complementary error function erfcf()
                            
                                How to develop algorithm for this case (other than brute force)?
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?
                            
                                Python : Ramer-Douglas-Peucker (RDP) algorithm with number of points instead of epsilon
                            
                                In a text based adventure game, how do I prevent long confusing conditional code?
                            
                                Empty crossword solver in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Explain bpe (Byte Pair Encoding) with examples?

Tags:

algorithm

tokenize

nlp

lifang

People also ask