using regular expressions to exclude characters in a string search?

Question

I'm working with a Python 2.7.2 script to find lists of words inside of a text file that I'm using as a master word list.

I am calling the script in a terminal window, inputting any number of regular expressions, and then running the script.

So, if I pass in the two regular expressions "^.....$" and ".*z" it will print every five letter word that contains at least one "z".

What I am trying to do is add another regular expression to EXCLUDE a character from the strings. I would like to print out all words that have five letters, a "z", but -not- a "y".

Here is the code:

import re
import sys

def read_file_to_set(filename):
    words = None
    with open(filename) as f:
        words = [word.lower() for word in f.readlines()]
    return set(words)

def matches_all(word, regexes):
    for regex in regexes:
        if not regex.search(word):
            return False
    return True

if len(sys.argv) < 3:
    print "Needs a source dictionary and a series of regular expressions"
else:
    source = read_file_to_set(sys.argv[1])
    regexes = [re.compile(arg, re.IGNORECASE)
               for arg in sys.argv[2:]]
    for word in sorted(source):
        if matches_all(word.rstrip(), regexes):
            print word,

What modifiers can I put onto the regular expressions that I pass into the program to allow for me to exclude certain characters from the strings it prints?

If that isn't possible, what needs to be implemented in the code?

piojo · Accepted Answer

Specifying a character that doesn't match is done with like this (this matches anything except a lower case letter):

[^a-z]

So to match a string that does not contain "y", the regex is: ^[^y]*$

Character by character explanation:

^ means "beginning" if it comes at the start of the regex. Similarly, $ means "end" if it comes at the end. [abAB] matches any character within, or a range. For example, match any hex character (upper or lower case): [a-fA-F0-9]

* means 0 or more of the previous expression. As the first character inside [], ^ has a different meaning: it means "not". So [^a-fA-F0-9] matches any non-hex character.

When you put a pattern between ^ and $, you force the regex to match the string exactly (nothing before or after the pattern). Combine all these facts:

^[^y]*$ means string that is exactly 0 or more characters that are not 'y'. (To do something more interesting, you could check for non-numbers: ^[^0-9]$

VooDooNOFX · Answer

You can accomplish this with negative look arounds. This isn't a task that Regexs are particularly fast at, but it does work. To match everything except a sub-string foo, you can use:

>>> my_regex = re.compile(r'^((?!foo).)*$', flags = re.I)
>>> print my_regex.match(u'IMatchJustFine')
<_sre.SRE_Match object at 0x1034ea738>
>>> print my_regex.match(u'IMatchFooFine')
None

As others have pointed out, if you're only matching a single character, then a simple not will suffice. Longer and more complex negative matches would need to use this approach.

using regular expressions to exclude characters in a string search?

Tags:

python

string

regex

Zack Cruise

2 Answers

piojo

VooDooNOFX

Recent Activity

Donate For Us

using regular expressions to exclude characters in a string search?

Tags:

python

string

regex

Zack Cruise

2 Answers

piojo

VooDooNOFX

Related questions

Recent Activity

Donate For Us