Say I have something like the following:
dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )
(i.e. strip any lines starting with # from the multi-line string src
)
src
is very large, so I'm assuming .split()
will create a large intermediate list. I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? Is my assumption correct? What's the most (memory) efficient way to handle this?
Clarification: This arose due to my code running out of memory. I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src
?
Here's a way to do a general type of split using itertools
>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'
groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures
Probably better to spend a few lines and make a generator
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
... i=j=0
... while True:
... try:
... j = s.index(t, i)
... except ValueError:
... if i<len(s):
... yield s[i:]
... raise StopIteration
... yield s[i:j]
... i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'
re
has a method called finditer
, that could be used for this purpose too
>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'
comparing the performance is an exercise for the OP to try on the real data
buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")
Of course, this really makes the most sense if you use StringIO
throughout. It works mostly the same as files. You can seek, read, write, iterate (as shown), etc.
In your existing code you can change the list to a generator expression:
dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")
This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part.
A completely different approach that avoids the temporary construction of both lists is to use a regular expression:
import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)
This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. Here are some performance measurements of the proposed solutions:
init = r''' import re, StringIO regex = re.compile('^#.*\n?', re.M) src = ''.join('foo bar baz\n' for _ in range(100000)) ''' method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])' method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")' method3 = 'regex.sub("", src)' method4 = ''' buffer = StringIO.StringIO(src) dest = "".join(line for line in buffer if line[:1] != "#") ''' import timeit for method in [method1, method2, method3, method4]: print timeit.timeit(method, init, number = 100)
Results:
9.38s # Split then join with temporary list 9.92s # Split then join with generator 8.60s # Regular expression 64.56s # StringIO
As you can see the regular expression is the fastest method.
From your comments I can see that you are not actually interested in avoiding creating temporary objects. What you really want is to reduce the memory requirements for your program. Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem.
If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. Instead store the input and output in files on the disk and read from them in a streaming fashion. This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time.
If I understand your question about "more generic calls to split()" correctly, you could use re.finditer
, like so:
output = ""
for i in re.finditer("^.*\n",input,re.M):
i=i.group(0).strip()
if i.startswith("#"):
continue
output += i + "\n"
Here you can replace the regular expression by something more sophisticated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With