I need to split a string into words, but also get the starting and ending offset of the words. So, for example, if the input string is:
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
I want to get:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
I've got some working code that does this using input_string.split and calls to .index, but it's slow. I tried to code it by manually iterating through the string, but that was slower still. Does anyone have a fast algorithm for this?
Here are my two versions:
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
def manual_iteration(line):
start = 0
offsets = []
word = ''
for off, char in enumerate(line + ' '):
if char in ' \t\r\n':
if off > start:
offsets.append((word, start, off - 1))
start = off + 1
word = ''
else:
word += char
return offsets
By using timeit, "using_split" is the fastest, followed by "manual_iteration", then the slowest so far is using re.finditer as suggested below.
The following will do it:
import re
s = 'ONE ONE ONE \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)
This produces:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
The following runs slightly faster - it saves about 30%. All I did was define the functions in advance:
def using_split2(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append((word, word_offset, running_offset - 1))
return offsets
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With