In the Byte Pair Encoding algorithm, there's a replacement step where it changes the character strings delimited by spaces to bigrams.
I.e., given a list of str
tuples as such:
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
And a string tuple: ('i', 's')
How do I process the list such that it iterates through all the tuple keys and and replace ('i', 's')
with ('is')
?, i.e. the output Counter
will look something like this:
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
I've tried this:
>>> cin
[('t', 'h', 'i', 's', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('i', 's', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
>>> [tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]
[('t', 'h', 'is', '\ue000'), ('c', 'o', 'r', 'p', 'u', 's', '\ue000'), ('i', 'n', '\ue000'), ('t', 'x', 't', 'f', 'i', 'l', 'e', '\ue000'), ('t', 'h', 'e', '\ue000'), ('s', 'e', 'n', 't', 'e', 'n', 'c', 'e', '\ue000'), ('b', 'a', 'r', '\ue000'), ('a', 'n', 'd', '\ue000'), ('is', '\ue000'), ('f', 'o', 'o', '\ue000'), ('f', 'i', 'r', 's', 't', '\ue000'), ('a', '\ue000'), ('.', '\ue000')]
but is there a more efficient way than looping through each word, then changing them to string to do a replace and splitting them again and then casting them back into tuples?
Would regex replacement be faster? Is there a way to work with the list of tuples without dealing with strings?
I've tried this and it seems like replacing the string with str.replace
is not the problem. It's really counting the bigrams and extracting them:
import io
from collections import Counter
import time
infile = 'big.txt' # comes from norvig.com/big.txt
n = 2
with io.open(infile, 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
for j in range(1,6400):
unused_char = unichr(ord(u'\uE001') + j)
start = time.time()
char_bigrams = zip(*[text[i:] for i in range(n)])
bigram_time = time.time() - start
start = time.time()
most_freq_bigram = Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]
max_time = time.time() - start
start = time.time()
text = text.replace(''.join(most_freq_bigram), unused_char)
replace_time = time.time() - start
print j, ''.join(most_freq_bigram), most_freq_bigram, bigram_time, max_time, replace_time
print text
This is tested on norvig.com/big.txt
[out]:
1 th (u't', u'h') 0.896255016327 3.28389787674 0.0253069400787
2 e (u'\ue002', u'e') 1.47053217888 3.16544914246 0.0280749797821
3 in (u'i', u'n') 1.13404297829 3.10529899597 0.0245559215546
4 an (u'a', u'n') 1.20013689995 3.63801002502 0.0242891311646
5 er (u'e', u'r') 1.41387891769 3.13376092911 0.0237591266632
6 on (u'o', u'n') 1.22826981544 3.06997895241 0.0227301120758
7 re (u'r', u'e') 1.21916294098 2.97599196434 0.0238041877747
8 at (u'a', u't') 1.14608097076 2.97988891602 0.0226521492004
9 en (u'e', u'n') 1.20747494698 2.88649988174 0.019054889679
10 ed (u'e', u'd') 1.16296696663 2.8995718956 0.0198271274567
11 is (u'i', u's') 1.17692494392 3.02292394638 0.0228500366211
12 d (u'\ue005', u'd') 1.13779211044 2.85169506073 0.0229239463806
I've already experimented with scikit-learn
CountVectorizer and i didn't seem to be as fast as using zip
, see Fast/Optimize N-gram implementations in python
Also, without them filter
operation in the Counter
step, it took even longer. The Counter operation is taking 3 seconds per iteration =(
How else can this operation be optimized?
Counter(filter(lambda x: u"\uE000" not in x and '\n' not in x, char_bigrams)).most_common(1)[0][0]
[tuple(' '.join(i).replace(' '.join(qtuple), ''.join(qtuple)).split()) for i in cin]
I'll expand it so it's easier to see what's happening
result = []
qtuple = ("i", "s")
for i in cin:
f = " ".join(qtuple)
r = "".join(qtuple)
word = ' '.join(i)
word = word.replace(f, r)
word = word.split()
word = tuple(word)
result.append(word)
print(result)
Look for things you can move outside of the loop. We can precompute the replacements instead of computing them for each word
find = " ".join(qtuple)
replacement = "".join(qtuple)
result = []
# this will join and split each word once
for i in cin:
word = " ".join(i)
# if you had multiple replacements to do, they should be in an inner loop here
word = word.replace(find, replacement)
result.append(tuple(word.split(" ")))
print(result)
Perhaps someone else can speak to the relatively efficiency of str.replace versus re.replace. Personally I tend to avoid regular expressions if a simple replace will do it, just for readability.
Further efficiency gains can be realized by changing the data structure for the input. If the replacement symbols are single characters then we can use a string instead of a list of tuples and avoid any joins inside the loop.
result = []
replacements = [("\ue000", "X"), ("is", "Z")]
s = "".join(["".join(t) for t in cin])
for f, r in replacements:
s = s.replace(f,r)
print(s)
# output: thZXcorpusXinXtxtfileXtheXsentenceXbarXandXZXfooXfirstXaX.X
I think the question needs some requirements added to it to explain why the chosen data structure is advantageous. From an efficiency point of view, and in the context of the byte pair encoding algorithm, a string makes a lot more sense to me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With