Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to avoid substrings

I currently process sections of a string like this:

for (i, j) in huge_list_of_indices:
    process(huge_text_block[i:j])

I want to avoid the overhead of generating these temporary substrings. Any ideas? Perhaps a wrapper that somehow uses index offsets? This is currently my bottleneck.

Note that process() is another python module that expects a string as input.

Edit:

A few people doubt there is a problem. Here are some sample results:

import time
import string
text = string.letters * 1000

def timeit(fn):
    t1 = time.time()
    for i in range(len(text)):
        fn(i)
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1) * 1000)

def test_1(i):
    return text[i:]

def test_2(i):
    return text[:]

def test_3(i):
    return text

timeit(test_1)
timeit(test_2)
timeit(test_3)

Output:

test_1 took 972.046 ms
test_2 took 47.620 ms
test_3 took 43.457 ms
like image 911
hoju Avatar asked Dec 20 '10 00:12

hoju


2 Answers

I think what you are looking for are buffers.

The characteristic of buffers is that they "slice" an object supporting the buffer interface without copying its content, but essentially opening a "window" on the sliced object content. Some more technical explanation is available here. An excerpt:

Python objects implemented in C can export a group of functions called the “buffer interface.” These functions can be used by an object to expose its data in a raw, byte-oriented format. Clients of the object can use the buffer interface to access the object data directly, without needing to copy it first.

In your case the code should look more or less like this:

>>> s = 'Hugely_long_string_not_to_be_copied'
>>> ij = [(0, 3), (6, 9), (12, 18)]
>>> for i, j in ij:
...     print buffer(s, i, j-i)  # Should become process(...)
Hug
_lo
string

HTH!

like image 76
mac Avatar answered Sep 28 '22 02:09

mac


A wrapper that uses index offsets to a mmap object could work, yes.

But before you do that, are you sure that generating these substrings are a problem? Don't optimize before you have found out where the time and memory actually goes. I wouldn't expect this to be a significant problem.

like image 34
Lennart Regebro Avatar answered Sep 28 '22 00:09

Lennart Regebro