I'm working with a Python 2.7.2 script to find lists of words inside of a text file that I'm using as a master word list.
I am calling the script in a terminal window, inputting any number of regular expressions, and then running the script.
So, if I pass in the two regular expressions "^.....$" and ".*z" it will print every five letter word that contains at least one "z".
What I am trying to do is add another regular expression to EXCLUDE a character from the strings. I would like to print out all words that have five letters, a "z", but -not- a "y".
Here is the code:
import re
import sys
def read_file_to_set(filename):
words = None
with open(filename) as f:
words = [word.lower() for word in f.readlines()]
return set(words)
def matches_all(word, regexes):
for regex in regexes:
if not regex.search(word):
return False
return True
if len(sys.argv) < 3:
print "Needs a source dictionary and a series of regular expressions"
else:
source = read_file_to_set(sys.argv[1])
regexes = [re.compile(arg, re.IGNORECASE)
for arg in sys.argv[2:]]
for word in sorted(source):
if matches_all(word.rstrip(), regexes):
print word,
What modifiers can I put onto the regular expressions that I pass into the program to allow for me to exclude certain characters from the strings it prints?
If that isn't possible, what needs to be implemented in the code?
Specifying a character that doesn't match is done with like this (this matches anything except a lower case letter):
[^a-z]
So to match a string that does not contain "y", the regex is: ^[^y]*$
Character by character explanation:
^
means "beginning" if it comes at the start of the regex.
Similarly, $
means "end" if it comes at the end.
[abAB]
matches any character within, or a range. For example, match any hex character (upper or lower case): [a-fA-F0-9]
*
means 0 or more of the previous expression.
As the first character inside []
, ^
has a different meaning: it means "not". So [^a-fA-F0-9]
matches any non-hex character.
When you put a pattern between ^
and $
, you force the regex to match the string exactly (nothing before or after the pattern). Combine all these facts:
^[^y]*$
means string that is exactly 0 or more characters that are not 'y'. (To do something more interesting, you could check for non-numbers: ^[^0-9]$
You can accomplish this with negative look arounds
. This isn't a task that Regexs are particularly fast at, but it does work. To match everything except a sub-string foo
, you can use:
>>> my_regex = re.compile(r'^((?!foo).)*$', flags = re.I)
>>> print my_regex.match(u'IMatchJustFine')
<_sre.SRE_Match object at 0x1034ea738>
>>> print my_regex.match(u'IMatchFooFine')
None
As others have pointed out, if you're only matching a single character, then a simple not will suffice. Longer and more complex negative matches would need to use this approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With