Not a complete newbie, but I still don't understand everything about Regular expressions. I was trying to use Regex to strip out <p> tags and my first attempt
<p\s*.*>
was so greedy it caught the whole line
<p someAttributes='example'>SomeText</p>
I got it to work with
((.|\s)*?)
This seems like it should be just as greedy, can anyone help me understand why it isnt?
Trying to make this question as language non-specific as possible, but I was doing this with ColdFusion's reReplaceNoCase if it makes a lot of difference.
\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines. *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times".
\\s*,\\s* It says zero or more occurrence of whitespace characters, followed by a comma and then followed by zero or more occurrence of whitespace characters. These are called short hand expressions. You can find similar regex in this site: http://www.regular-expressions.info/shorthand.html.
Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.
* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.
The key difference is the *?
part, which creates a reluctant quantifier, and so it tries to match as little as possible. The standard quantifier *
is a greedy quantifier and tries to match as much as possible.
See e.g. Greedy vs. Reluctant vs. Possessive Quantifiers
As Seth Robertson noted, you might want to use a regex that does not depend on the greedy/reluctant behaviour. Indeed, you can write a possessive regex for best performance:
<p\s*+[^>]*+>
Here, \s*+
matches any number of white space, while [^>]*+
matches any number of characters except >
. Both quantifiers do not track back in case of a mismatch, which improves runtime in case of a mismatch, and for some regex implementations also in case of a match (because internal backtracking data can be omitted).
Note that, if there are other tags starting with <p
(didn't write HTML directly for a long time), you match these too. If you don't want that, use a regex like this:
<p(\s++[^>]*+)?>
This makes the whole section between <p
and >
optional.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With