Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to remove first and last lines from a Python string

I have a python script that, for various reasons, has a variable that is a fairly large string, say 10mb long. This string contains multiple lines.

What is the fastest way to remove the first and last lines of this string? Due to the size of the string, the faster the operation, the better; there is an emphasis on speed. The program returns a slightly smaller string, sans the first and last lines.

'\n'.join(string_variable[-1].split('\n')[1:-1]) is the easiest way to do this, but it's extremely slow because the split() function copies the object in memory, and the join() copies it again.

Example string:

*** START OF DATA ***
data
data
data
*** END OF DATA ***

Extra credit: Have this program not choke if there is no data in between; this is optional, since for my case there shouldn't be a string with no data in between.

like image 910
james chang Avatar asked Jan 25 '15 07:01

james chang


2 Answers

First split at '\n' once and then check if the string at last index contains '\n', if yes str.rsplit at '\n' once and pick the item at 0th index otherwise return an empty string:

def solve(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don't split at all, find the index of '\n' from either end and slice the string:

>>> def solve_fast(s):
    ind1 = s.find('\n')
    ind2 = s.rfind('\n')
    return s[ind1+1:ind2]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve_fast(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve_fast(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop
like image 62
Ashwini Chaudhary Avatar answered Oct 20 '22 00:10

Ashwini Chaudhary


Consider a string s that is something like this:

s = "line1\nline2\nline3\nline4\nline5"

The following code...

s[s.find('\n')+1:s.rfind('\n')]

...produces the output:

'line2\nline3\nline4'

And, thus, is the shortest code to remove the first and the last line of a string. I do not think that the .find and .rfind methods do anything but search for a given string. Try out the speed!

like image 38
Benjamin Spiegl Avatar answered Oct 19 '22 23:10

Benjamin Spiegl