Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a string and rejoin it without creating an intermediate list in Python?

Say I have something like the following:

dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )

(i.e. strip any lines starting with # from the multi-line string src)

src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?

Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?

like image 675
Tom Avatar asked Aug 23 '10 07:08

Tom


4 Answers

Here's a way to do a general type of split using itertools

>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'

groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures

Probably better to spend a few lines and make a generator

>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
...     i=j=0
...     while True:
...         try:
...             j = s.index(t, i)
...         except ValueError:
...             if i<len(s):
...                 yield s[i:]
...             raise StopIteration
...         yield s[i:j]
...         i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'

re has a method called finditer, that could be used for this purpose too

>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'

comparing the performance is an exercise for the OP to try on the real data

like image 118
John La Rooy Avatar answered Oct 16 '22 11:10

John La Rooy


buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")

Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.

like image 40
Matthew Flaschen Avatar answered Oct 16 '22 11:10

Matthew Flaschen


In your existing code you can change the list to a generator expression:

dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")

This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.

A completely different approach that avoids the temporary construction of both lists is to use a regular expression:

import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)

This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:

init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''

method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''

import timeit

for method in [method1, method2, method3, method4]:
    print timeit.timeit(method, init, number = 100)

Results:

 9.38s   # Split then join with temporary list
 9.92s   # Split then join with generator
 8.60s   # Regular expression
64.56s   # StringIO

As you can see the regular expression is the fastest method.

From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.

If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.

like image 4
Mark Byers Avatar answered Oct 16 '22 10:10

Mark Byers


If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:

output = ""

for i in re.finditer("^.*\n",input,re.M):
    i=i.group(0).strip()
    if i.startswith("#"):
        continue
    output += i + "\n"

Here you can replace the regular expression by something more sophisticated.

like image 2
loevborg Avatar answered Oct 16 '22 10:10

loevborg