I'm looking to find words in a string that match a specific pattern. Problem is, if the words are part of an email address, they should be ignored.
To simplify, the pattern of the "proper words" \w+\.\w+
- one or more characters, an actual period, and another series of characters.
The sentence that causes problem, for example, is a.a b.b:c.c [email protected]
.
The goal is to match only [a.a, b.b, c.c]
. With most Regexes I build, e.e
returns as well (because I use some word boundary match).
For example:
>>> re.findall(r"(?:^|\s|\W)(?<!@)(\w+\.\w+)(?!@)\b", "a.a b.b:c.c [email protected]")
['a.a', 'b.b', 'c.c', 'e.e']
How can I match only among words that do not contain "@"?
To match any character except a list of excluded characters, put the excluded charaters between [^ and ] . The caret ^ must immediately follow the [ or else it stands for just itself.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
I would definitely clean it up first and simplify the regex.
first we have
words = re.split(r':|\s', "a.a b.b:c.c [email protected]")
then filter out the words that have an @
in them.
words = [re.search(r'^((?!@).)*$', word) for word in words]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With