For very large strings (spanning multiple lines) is it faster to use Python's built-in string search or to split the large string (perhaps on <code>\n</code>) and iteratively search the smaller strings? E.g., for very large strings: <pre class="prettyprint"><code>for l in get_mother_of_all_strings().split('\n'): if 'target' in l: return True return False </code></pre> or <pre class="prettyprint"><code>return 'target' in get_mother_of_all_strings() </code></pre>

The second one is a lot faster, here are some measurement data: <pre class="prettyprint"><code>def get_mother_of_all_strings(): return "abcdefg\nhijklmnopqr\nstuvwxyz\naatargetbb" first: 2.00 second: 0.26 </code></pre>

If you are only matching once to see if the substring is in the string at all, then both methods are about the same, and you get more overhead for splitting it into separate line by line searches; so the large string search is a bit faster. If you have to do multiple matches, then I would tokenize the string and stuff them into a dictionary or set and store it in memory. <pre class="prettyprint"><code>s = 'SOME REALLY LONG STRING' tokens = set(s.split()) return substring in tokens </code></pre>

Python string search efficiency

Tags:

performance

python

For very large strings (spanning multiple lines) is it faster to use Python's built-in string search or to split the large string (perhaps on \n) and iteratively search the smaller strings?

E.g., for very large strings:

Click to copy

for l in get_mother_of_all_strings().split('\n'):
 if 'target' in l:
   return True
return False

Click to copy

return 'target' in get_mother_of_all_strings()

393

asked Aug 05 '11 22:08

sholsapp

3 Answers

~~Probably~~ Certainly the second, I don't see any difference in doing a search in a big string or many in small strings. You may skip some chars thanks to the shorter lines, but the split operation has its costs too (searching for \n, creating n different strings, creating the list) and the loop is done in python.

The string __contain__ method is implemented in C and so noticeably faster.

Also consider that the second method aborts as soon as the first match is found, but the first one splits all the string before even starting to search inside it.

This is rapidly proven with a simple benchmark:

Click to copy

import timeit

prepare = """
with open('bible.txt') as fh:
    text = fh.read()
"""

presplit_prepare = """
with open('bible.txt') as fh:
    text = fh.read()
lines = text.split('\\n')
"""

longsearch = """
'hello' in text
"""

splitsearch = """
for line in text.split('\\n'):
    if 'hello' in line:
        break
"""

presplitsearch = """
for line in lines:
    if 'hello' in line:
        break
"""


benchmark = timeit.Timer(longsearch, prepare)
print "IN on big string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(splitsearch, prepare)
print "IN on splitted string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(presplitsearch, presplit_prepare)
print "IN on pre-splitted string takes:", benchmark.timeit(1000), "seconds"

The result is:

Click to copy

IN on big string takes: 4.27126097679 seconds
IN on splitted string takes: 35.9622690678 seconds
IN on pre-splitted string takes: 11.815297842 seconds

The bible.txt file actually is the bible, I found it here: http://patriot.net/~bmcgin/kjvpage.html (text version)

answered Sep 28 '22 19:09

GaretJax

The second one is a lot faster, here are some measurement data:

Click to copy

def get_mother_of_all_strings():
    return "abcdefg\nhijklmnopqr\nstuvwxyz\naatargetbb"

first: 2.00
second: 0.26

answered Sep 28 '22 19:09

Karoly Horvath

If you are only matching once to see if the substring is in the string at all, then both methods are about the same, and you get more overhead for splitting it into separate line by line searches; so the large string search is a bit faster.

If you have to do multiple matches, then I would tokenize the string and stuff them into a dictionary or set and store it in memory.

Click to copy

s = 'SOME REALLY LONG STRING'
tokens = set(s.split())
return substring in tokens

answered Sep 28 '22 17:09

jontsai

Related questions
                            
                                In Django, how can I automatically set "cache-control" for every template render?
                            
                                django.db.utils.DatabaseError
                            
                                Using an iterator to print integers
                            
                                What command to use to introspect instances in scala REPL?
                            
                                is there a way to script in Python to change user passwords in Linux? if so, how?
                            
                                Fetch only the last 128 bytes of an mp3 file over a http connection
                            
                                Django: how to use settings in templates? [duplicate]
                            
                                Getting specific line and value with Python DictReader
                            
                                What does underscoring methods connote?
                            
                                replace empty string(s) in tuple
                            
                                Modifiying CSV export in scrapy
                            
                                Regular Expression GUI?
                            
                                Converting a nested dictionary to a list
                            
                                Storing an array of integers with Django
                            
                                lxml memory usage when parsing huge xml in python
                            
                                Django Error Reporting Email when Debug = True
                            
                                python: force two zeroes after dot when converting float to string
                            
                                Django and wrap lines problem
                            
                                Looking for more pythonic list comparison solution
                            
                                Python: Use an import done inside of a class in a function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python string search efficiency

Tags:

performance

python

sholsapp

People also ask

3 Answers

GaretJax

Karoly Horvath

jontsai

Recent Activity

Donate For Us