What's the most efficient way to find one of several substrings in Python?

2 Answers

I would assume a regex is better than checking for each substring individually because conceptually the regular expression is modeled as a DFA, and so as the input is consumed all matches are being tested for at the same time (resulting in one scan of the input string).

So, here is an example:

import re  def work():   to_find = re.compile("cat|fish|dog")   search_str = "blah fish cat dog haha"   match_obj = to_find.search(search_str)   the_index = match_obj.start()  # produces 5, the index of fish   which_word_matched = match_obj.group()  # "fish"   # Note, if no match, match_obj is None

UPDATE: Some care should be taken when combining words in to a single pattern of alternative words. The following code builds a regex, but escapes any regex special characters and sorts the words so that longer words get a chance to match before any shorter prefixes of the same word:

def wordlist_to_regex(words):     escaped = map(re.escape, words)     combined = '|'.join(sorted(escaped, key=len, reverse=True))     return re.compile(combined)  >>> r.search('smash atomic particles').span() (6, 10) >>> r.search('visit usenet:comp.lang.python today').span() (13, 29) >>> r.search('a north\south division').span() (2, 13) >>> r.search('012cat').span() (3, 6) >>> r.search('0123dog789cat').span() (4, 7)

END UPDATE

It should be noted that you will want to form the regex (ie - call to re.compile()) as little as possible. The best case would be you know ahead of time what your searches are (or you compute them once/infrequently) and then save the result of re.compile somewhere. My example is just a simple nonsense function so you can see the usage of the regex. There are some more regex docs here:

http://docs.python.org/library/re.html

Hope this helps.

UPDATE: I am unsure about how python implements regular expressions, but to answer Rax's question about whether or not there are limitations of re.compile() (for example, how many words you can try to "|" together to match at once), and the amount of time to run compile: neither of these seem to be an issue. I tried out this code, which is good enough to convince me. (I could have made this better by adding timing and reporting results, as well as throwing the list of words into a set to ensure there are no duplicates... but both of these improvements seem like overkill). This code ran basically instantaneously, and convinced me that I am able to search for 2000 words (of size 10), and that and of them will match appropriately. Here is the code:

import random import re import string import sys  def main(args):     words = []     letters_and_digits = "%s%s" % (string.letters, string.digits)     for i in range(2000):         chars = []         for j in range(10):             chars.append(random.choice(letters_and_digits))         words.append(("%s"*10) % tuple(chars))     search_for = re.compile("|".join(words))     first, middle, last = words[0], words[len(words) / 2], words[-1]     search_string = "%s, %s, %s" % (last, middle, first)     match_obj = search_for.search(search_string)     if match_obj is None:         print "Ahhhg"         return     index = match_obj.start()     which = match_obj.group()     if index != 0:         print "ahhhg"         return     if words[-1] != which:         print "ahhg"         return      print "success!!! Generated 2000 random words, compiled re, and was able to perform matches."  if __name__ == "__main__":     main(sys.argv)

UPDATE: It should be noted that the order of of things ORed together in the regex matters. Have a look at the following test inspired by TZOTZIOY:

>>> search_str = "01catdog" >>> test1 = re.compile("cat|catdog") >>> match1 = test1.search(search_str) >>> match1.group() 'cat' >>> match1.start() 2 >>> test2 = re.compile("catdog|cat")  # reverse order >>> match2 = test2.search(search_str) >>> match2.group() 'catdog' >>> match2.start() 2

This suggests the order matters :-/. I am not sure what this means for Rax's application, but at least the behavior is known.

UPDATE: I posted this questions about the implementation of regular expressions in Python which will hopefully give us some insight into the issues found with this question.

132

answered Sep 23 '22 05:09

Tom

subs = ['cat', 'fish', 'dog'] sentences = ['0123dog789cat']  import re  subs = re.compile("|".join(subs)) def search():     for sentence in sentences:         result = subs.search(sentence)         if result != None:             return (result.group(), result.span()[0])  # ('dog', 4)

answered Sep 25 '22 05:09

Unknown

Related questions
                            
                                Access element of a vector in a Spark DataFrame (Logistic Regression probability vector) [duplicate]
                            
                                How to write meaningful docstrings?
                            
                                How to copy a sqlite table from a disk database to a memory database in python? [duplicate]
                            
                                How can I store an array of strings in a Django model?
                            
                                Sending ^C to Python subprocess objects on Windows
                            
                                Executing tasks in parallel in python
                            
                                changing the class of a python object (casting)
                            
                                OpenCV - visualize polygonal curve(s) extracted with cv2.approxPolyDP()
                            
                                How do you set up a Python WSGI server under IIS?
                            
                                Create a temporary FIFO (named pipe) in Python?
                            
                                How to properly union with set
                            
                                How do I rename a (work)sheet in a Google Sheets spreadsheet using the API in Python?
                            
                                How does mask_zero in Keras Embedding layer work?
                            
                                how to handle an asymptote/discontinuity with Matplotlib
                            
                                Python Iterators: What does iglob()'s Iterator provide over glob()'s list?
                            
                                pip: pulling updates from remote git repository
                            
                                Roll rows of a matrix independently
                            
                                How to import python file from git submodule
                            
                                How to print like jupyter notebook's default cell output
                            
                                how to implement tensorflow's next_batch for own data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the most efficient way to find one of several substrings in Python?

Tags:

python

string

substring

regex

Roee Adler

People also ask

2 Answers

Tom

Unknown

Recent Activity

Donate For Us