Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: check if any word in a list of words matches any pattern in a list of regular expression patterns

Tags:

python

regex

I have a long list of words and regular expression patterns in a .txt file, which I read in like this:

with open(fileName, "r") as f1:
    pattern_list = f1.read().split('\n')

for illustration, the first seven look like this:

print pattern_list[:7] 
# ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']

I want to know whenever I match a word from an input string to any of the words/patterns in pattern_list. The below sort of works, but I see two problems:

  1. First, it seems pretty inefficient to re.compile() every item in my pattern_list every time I inspect a new string_input... but when I tried to store the re.compile(raw_str) objects in a list (to then be able to reuse the already compiled regex list for something more like if w in regex_compile_list:, it didn't work right.)
  2. Second, it sometimes doesn't work as I expect - notice how
    • abuse* matched with abusive
    • abusi* matched with abused and abuse
    • ache* matched with aching

What am I doing wrong, and how can I be more efficient? Thanks in advance for your patience with a noob, and thanks for any insight!

string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."
for raw_str in pattern_list:
    pat = re.compile(raw_str)
    for w in string_input.split():
        if pat.match(w):
            print "matched:", raw_str, "with:", w
#matched: abandon* with: abandoned
#matched: abandon* with: abandon
#matched: abuse* with: abused
#matched: abuse* with: abusive,
#matched: abuse* with: abuse
#matched: abusi* with: abused
#matched: abusi* with: abusive,
#matched: abusi* with: abuse
#matched: ache* with: aching
#matched: aching with: aching
#matched: advers* with: adversarial,
#matched: afraid with: afraid
#matched: aggress* with: aggressive
#matched: aggress* with: aggression.
like image 571
CJH Avatar asked Jun 12 '13 14:06

CJH


People also ask

How do you check if a string matches a regex pattern in Python?

Method : Using join regex + loop + re.match() This task can be performed using combination of above functions. In this, we create a new regex string by joining all the regex list and then match the string against it to check for match using match() with any of the element of regex list.


2 Answers

For matching shell-style wildcards you could (ab)use the module fnmatch

As fnmatch is primary designed for filename comparaison, the test will be case sensitive or not depending your operating system. So you'll have to normalize both the text and the pattern (here, I use lower() for that purpose)

>>> import fnmatch

>>> pattern_list = ['abandon*', 'abuse*', 'abusi*', 'aching', 'advers*', 'afraid', 'aggress*']
>>> string_input = "People who have been abandoned or abused will often be afraid of adversarial, abusive, or aggressive behavior. They are aching to abandon the abuse and aggression."


>>> for pattern in pattern_list:
...     l = fnmatch.filter(string_input.split(), pattern)
...     if l:
...             print pattern, "match", l

Producing:

abandon* match ['abandoned', 'abandon']
abuse* match ['abused', 'abuse']
abusi* match ['abusive,']
aching match ['aching']
advers* match ['adversarial,']
afraid match ['afraid']
aggress* match ['aggressive', 'aggression.']
like image 165
Sylvain Leroux Avatar answered Oct 23 '22 00:10

Sylvain Leroux


abandon* will match abandonnnnnnnnnnnnnnnnnnnnnnn, and not abandonasfdsafdasf. You want

abandon.*

instead.

like image 24
Joe Frambach Avatar answered Oct 23 '22 01:10

Joe Frambach