Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Regex, why is "((.|\s)*?)" different than "\s*.*"

Tags:

regex

Not a complete newbie, but I still don't understand everything about Regular expressions. I was trying to use Regex to strip out <p> tags and my first attempt

<p\s*.*>

was so greedy it caught the whole line

<p someAttributes='example'>SomeText</p>

I got it to work with

((.|\s)*?)

This seems like it should be just as greedy, can anyone help me understand why it isnt?

Trying to make this question as language non-specific as possible, but I was doing this with ColdFusion's reReplaceNoCase if it makes a lot of difference.

like image 295
invertedSpear Avatar asked Jun 06 '11 20:06

invertedSpear


People also ask

What does S * mean in regex?

\s is fairly simple - it's a common shorthand in many regex flavours for "any whitespace character". This includes spaces, tabs, and newlines. *? is a little harder to explain. The * quantifier is fairly simple - it means "match this token (the character class in this case) zero or more times".

What does this mean in regex \\ s *?

\\s*,\\s* It says zero or more occurrence of whitespace characters, followed by a comma and then followed by zero or more occurrence of whitespace characters. These are called short hand expressions. You can find similar regex in this site: http://www.regular-expressions.info/shorthand.html.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What is the difference between and * in regex?

* means zero-or-more, and + means one-or-more. So the difference is that the empty string would match the second expression but not the first.


1 Answers

The key difference is the *? part, which creates a reluctant quantifier, and so it tries to match as little as possible. The standard quantifier * is a greedy quantifier and tries to match as much as possible.

See e.g. Greedy vs. Reluctant vs. Possessive Quantifiers

As Seth Robertson noted, you might want to use a regex that does not depend on the greedy/reluctant behaviour. Indeed, you can write a possessive regex for best performance:

<p\s*+[^>]*+>

Here, \s*+ matches any number of white space, while [^>]*+ matches any number of characters except >. Both quantifiers do not track back in case of a mismatch, which improves runtime in case of a mismatch, and for some regex implementations also in case of a match (because internal backtracking data can be omitted).

Note that, if there are other tags starting with <p (didn't write HTML directly for a long time), you match these too. If you don't want that, use a regex like this:

<p(\s++[^>]*+)?>

This makes the whole section between <p and > optional.

like image 183
Christian Semrau Avatar answered Oct 24 '22 23:10

Christian Semrau