Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex multiple search

Tags:

python

regex

I need to search a string for multiple words.

import re

words = [{'word':'test1', 'case':False}, {'word':'test2', 'case':False}]

status = "test1 test2"

for w in words:
    if w['case']:
        r = re.compile("\s#?%s" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile("\s#?%s" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']

For some reason, this will only ever find "test2" and never "test1". Why is this?

I know I can use | delimitated searches but there could be hundreds of words which is why I am using a for loop.

like image 375
Hanpan Avatar asked May 28 '11 18:05

Hanpan


People also ask

How do I search for multiple patterns in Python?

Use | (pipe) operator to specify multiple patterns.

How can I find all matches to a regular expression in Python?

findall(pattern, string) returns a list of matching strings. re. finditer(pattern, string) returns an iterator over MatchObject objects.

What is the difference between re search and re match?

re.search() is returning match object and implies that first match found at index 69. re. match() is returning none because match exists in the second line of the string and re. match() only works if the match is found at the beginning of the string.

What is Finditer in Python?

According to Python docs, re.finditer(pattern, string, flags=0) Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.


2 Answers

There is no space before test1 in status, while your generated regular expressions require there to be a space.

You can modify the test to match either after a space or at the beginning of a line:

for w in words:
    if w['case']:
        r = re.compile("(^|\s)#?%s" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile("(^|\s)#?%s" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']
like image 89
Martijn Pieters Avatar answered Oct 27 '22 15:10

Martijn Pieters


As Martijn pointed out, there's no space before test1. But also your code doesn't properly handle the case when a word is longer. Your code would find test2blabla as an instance of test2, and I'm not sure if that is what you want.

I suggest using word boundary regex \b:

for w in words:
    if w['case']:
        r = re.compile(r"\b%s\b" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile(r"\b%s\b" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']

EDIT:

I should've pointed out that if you really want to allow only (whitespace)word or (whitespace)#word format, you cannot use \b.

like image 2
Norbert P. Avatar answered Oct 27 '22 16:10

Norbert P.