Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using regular expressions to exclude characters in a string search?

I'm working with a Python 2.7.2 script to find lists of words inside of a text file that I'm using as a master word list.

I am calling the script in a terminal window, inputting any number of regular expressions, and then running the script.

So, if I pass in the two regular expressions "^.....$" and ".*z" it will print every five letter word that contains at least one "z".

What I am trying to do is add another regular expression to EXCLUDE a character from the strings. I would like to print out all words that have five letters, a "z", but -not- a "y".

Here is the code:

import re
import sys

def read_file_to_set(filename):
    words = None
    with open(filename) as f:
        words = [word.lower() for word in f.readlines()]
    return set(words)

def matches_all(word, regexes):
    for regex in regexes:
        if not regex.search(word):
            return False
    return True

if len(sys.argv) < 3:
    print "Needs a source dictionary and a series of regular expressions"
else:
    source = read_file_to_set(sys.argv[1])
    regexes = [re.compile(arg, re.IGNORECASE)
               for arg in sys.argv[2:]]
    for word in sorted(source):
        if matches_all(word.rstrip(), regexes):
            print word,

What modifiers can I put onto the regular expressions that I pass into the program to allow for me to exclude certain characters from the strings it prints?

If that isn't possible, what needs to be implemented in the code?

like image 913
Zack Cruise Avatar asked Nov 12 '13 08:11

Zack Cruise


2 Answers

Specifying a character that doesn't match is done with like this (this matches anything except a lower case letter):

[^a-z]

So to match a string that does not contain "y", the regex is: ^[^y]*$

Character by character explanation:

^ means "beginning" if it comes at the start of the regex. Similarly, $ means "end" if it comes at the end. [abAB] matches any character within, or a range. For example, match any hex character (upper or lower case): [a-fA-F0-9]

* means 0 or more of the previous expression. As the first character inside [], ^ has a different meaning: it means "not". So [^a-fA-F0-9] matches any non-hex character.

When you put a pattern between ^ and $, you force the regex to match the string exactly (nothing before or after the pattern). Combine all these facts:

^[^y]*$ means string that is exactly 0 or more characters that are not 'y'. (To do something more interesting, you could check for non-numbers: ^[^0-9]$

like image 190
piojo Avatar answered Oct 03 '22 19:10

piojo


You can accomplish this with negative look arounds. This isn't a task that Regexs are particularly fast at, but it does work. To match everything except a sub-string foo, you can use:

>>> my_regex = re.compile(r'^((?!foo).)*$', flags = re.I)
>>> print my_regex.match(u'IMatchJustFine')
<_sre.SRE_Match object at 0x1034ea738>
>>> print my_regex.match(u'IMatchFooFine')
None

As others have pointed out, if you're only matching a single character, then a simple not will suffice. Longer and more complex negative matches would need to use this approach.

like image 44
VooDooNOFX Avatar answered Oct 03 '22 17:10

VooDooNOFX