Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string search efficiency

For very large strings (spanning multiple lines) is it faster to use Python's built-in string search or to split the large string (perhaps on \n) and iteratively search the smaller strings?

E.g., for very large strings:

for l in get_mother_of_all_strings().split('\n'):
 if 'target' in l:
   return True
return False

or

return 'target' in get_mother_of_all_strings()
like image 393
sholsapp Avatar asked Aug 05 '11 22:08

sholsapp


People also ask

Is regex faster than string search Python?

Summary: find and in depend on string length and location of pattern in the string while regex is somehow string-length independent and faster for very long strings with the pattern at the end. Save this answer.

Is regex faster than find?

Is regex faster than find? Just so you know, the main thing slowing down the regexp here is having to compile the pattern every time. If the pattern is precompiled and used over and over again, matching is only about 25% slower than using find.

How do you check if a string is in a bigger string python?

The easiest way to check if a Python string contains a substring is to use the in operator. The in operator is used to check data structures for membership in Python. It returns a Boolean (either True or False ).

How do you search a string in Python?

String find() in Python The find(query) method is built-in to standard python. Just call the method on the string object to search for a string, like so: obj. find(“search”). The find() method searches for a query string and returns the character position if found.


3 Answers

Probably Certainly the second, I don't see any difference in doing a search in a big string or many in small strings. You may skip some chars thanks to the shorter lines, but the split operation has its costs too (searching for \n, creating n different strings, creating the list) and the loop is done in python.

The string __contain__ method is implemented in C and so noticeably faster.

Also consider that the second method aborts as soon as the first match is found, but the first one splits all the string before even starting to search inside it.

This is rapidly proven with a simple benchmark:

import timeit

prepare = """
with open('bible.txt') as fh:
    text = fh.read()
"""

presplit_prepare = """
with open('bible.txt') as fh:
    text = fh.read()
lines = text.split('\\n')
"""

longsearch = """
'hello' in text
"""

splitsearch = """
for line in text.split('\\n'):
    if 'hello' in line:
        break
"""

presplitsearch = """
for line in lines:
    if 'hello' in line:
        break
"""


benchmark = timeit.Timer(longsearch, prepare)
print "IN on big string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(splitsearch, prepare)
print "IN on splitted string takes:", benchmark.timeit(1000), "seconds"

benchmark = timeit.Timer(presplitsearch, presplit_prepare)
print "IN on pre-splitted string takes:", benchmark.timeit(1000), "seconds"

The result is:

IN on big string takes: 4.27126097679 seconds
IN on splitted string takes: 35.9622690678 seconds
IN on pre-splitted string takes: 11.815297842 seconds

The bible.txt file actually is the bible, I found it here: http://patriot.net/~bmcgin/kjvpage.html (text version)

like image 64
GaretJax Avatar answered Sep 28 '22 19:09

GaretJax


The second one is a lot faster, here are some measurement data:

def get_mother_of_all_strings():
    return "abcdefg\nhijklmnopqr\nstuvwxyz\naatargetbb"

first: 2.00
second: 0.26
like image 42
Karoly Horvath Avatar answered Sep 28 '22 19:09

Karoly Horvath


If you are only matching once to see if the substring is in the string at all, then both methods are about the same, and you get more overhead for splitting it into separate line by line searches; so the large string search is a bit faster.

If you have to do multiple matches, then I would tokenize the string and stuff them into a dictionary or set and store it in memory.

s = 'SOME REALLY LONG STRING'
tokens = set(s.split())
return substring in tokens
like image 32
jontsai Avatar answered Sep 28 '22 17:09

jontsai