How do I split a string and rejoin it without creating an intermediate list in Python?

Question

Say I have something like the following:

dest = "
".join( [line for line in src.split("
") if line[:1]!="#"] )

(i.e. strip any lines starting with # from the multi-line string src)

src is very large, so I'm assuming .split() will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?

Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src?

John La Rooy · Accepted Answer

Here's a way to do a general type of split using itertools

>>> import itertools as it
>>> src="hello
#foo
#bar
#baz
world
"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "
".__ne__) if i)
>>> '
'.join(s for s in line_gen if s[0]!="#")
'hello
world'

groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures

Probably better to spend a few lines and make a generator

>>> src="hello
#foo
#bar
#baz
world
"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
...     i=j=0
...     while True:
...         try:
...             j = s.index(t, i)
...         except ValueError:
...             if i<len(s):
...                 yield s[i:]
...             raise StopIteration
...         yield s[i:j]
...         i = j+1
...
>>> '
'.join(x for x in isplit(src, '
') if x[0]!='#')
'hello
world'

re has a method called finditer, that could be used for this purpose too

>>> import re
>>> src="hello
#foo
#bar
#baz
world
"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(
|$)",src))
>>> '
'.join(s for s in line_gen if not s.startswith("#"))
'hello
world'

comparing the performance is an exercise for the OP to try on the real data

Matthew Flaschen · Answer

buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")

Of course, this really makes the most sense if you use StringIO throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.

Mark Byers · Answer

In your existing code you can change the list to a generator expression:

dest = "
".join(line for line in src.split("
") if line[:1]!="#")

This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.

A completely different approach that avoids the temporary construction of both lists is to use a regular expression:

import re
regex = re.compile('^#.*
?', re.M)
dest = regex.sub('', src)

This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:

init = r'''
import re, StringIO
regex = re.compile('^#.*
?', re.M)
src = ''.join('foo bar baz
' for _ in range(100000))
'''

method1 = r'"
".join([line for line in src.split("
") if line[:1] != "#"])'
method2 = r'"
".join(line for line in src.split("
") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''

import timeit

for method in [method1, method2, method3, method4]:
    print timeit.timeit(method, init, number = 100)

Results:

 9.38s   # Split then join with temporary list
 9.92s   # Split then join with generator
 8.60s   # Regular expression
64.56s   # StringIO

As you can see the regular expression is the fastest method.

From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.

If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.

loevborg · Answer

If I understand your question about "more generic calls to split()" correctly, you could use re.finditer, like so:

output = ""

for i in re.finditer("^.*
",input,re.M):
    i=i.group(0).strip()
    if i.startswith("#"):
        continue
    output += i + "
"

Here you can replace the regular expression by something more sophisticated.

How do I split a string and rejoin it without creating an intermediate list in Python?

Tags:

python

iterator

generator

string

Tom

4 Answers

John La Rooy

Matthew Flaschen

Mark Byers

loevborg

Recent Activity

Donate For Us

How do I split a string and rejoin it without creating an intermediate list in Python?

Tags:

python

iterator

generator

string

Tom

4 Answers

John La Rooy

Matthew Flaschen

Mark Byers

loevborg

Related questions

Recent Activity

Donate For Us