To replace or remove characters that don't match a regex, call the replace() method on the string passing it a regular expression that uses the caret ^ symbol, e.g. /[^a-z]+/ .
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression.
Leverage negative lookahead
:
>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')
this RE is a contradiction in terms and therefore will never match anything.
NOTE:
In Python, re.match() implicitly adds a beginning-of-string anchor (\A
) to the start of the regular expression. This anchor is important for performance: without it, the entire string will be scanned. Those not using Python will want to add the anchor explicitly:
\A(?!x)x
This is actually quite simple, although it depends on the implementation / flags*:
$a
Will match a character a
after the end of the string. Good luck.
WARNING:
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the a
and return a negative match. (See comment below for more detail.)
* Originally I did not give much thought on multiline-mode regexp, where $
also matches the end of a line. In fact, it would match the empty string right before the newline, so an ordinary character like a
can never appear after $
.
One that was missed:
^\b$
It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.
look around:
(?=a)b
For regex newbies: The positive look ahead (?=a)
makes sure that the next character is a
, but doesn't change the search location (or include the 'a' in the matched string). Now that next character is confirmed to be a
, the remaining part of the regex (b
) matches only if the next character is b
. Thus, this regex matches only if a character is both a
and b
at the same time.
a\bc
, where \b
is a zero-width expression that matches word boundary.
It can't appear in the middle of a word, which we force it to.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With