To replace or remove characters that don't match a regex, call the replace() method on the string passing it a regular expression that uses the caret ^ symbol, e.g. /[^a-z]+/ .
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
Regex isn't suited to parse HTML because HTML isn't a regular language. Regex probably won't be the tool to reach for when parsing source code. There are better tools to create tokenized outputs. I would avoid parsing a URL's path and query parameters with regex.
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression.
Leverage negative lookahead:
>>> import re
>>> x=r'(?!x)x'
>>> r=re.compile(x)
>>> r.match('')
>>> r.match('x')
>>> r.match('y')
this RE is a contradiction in terms and therefore will never match anything.
NOTE:
In Python, re.match() implicitly adds a beginning-of-string anchor (\A) to the start of the regular expression.  This anchor is important for performance: without it, the entire string will be scanned.  Those not using Python will want to add the anchor explicitly:
\A(?!x)x
This is actually quite simple, although it depends on the implementation / flags*:
$a
Will match a character a after the end of the string. Good luck.
WARNING:
This expression is expensive -- it will scan the entire line, find the end-of-line anchor, and only then not find the a and return a negative match. (See comment below for more detail.)
* Originally I did not give much thought on multiline-mode regexp, where $ also matches the end of a line. In fact, it would match the empty string right before the newline, so an ordinary character like a can never appear after $.
One that was missed:
^\b$
It can't match because the empty string doesn't contain a word boundary. Tested in Python 2.5.
look around:
(?=a)b
For regex newbies: The positive look ahead (?=a) makes sure that the next character is a, but doesn't change the search location (or include the 'a' in the matched string). Now that next character is confirmed to be a, the remaining part of the regex (b) matches only if the next character is b. Thus, this regex matches only if a character is both a and b at the same time.
a\bc, where \b is a zero-width expression that matches word boundary.
It can't appear in the middle of a word, which we force it to.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With