Not a complete newbie, but I still don't understand everything about Regular expressions. I was trying to use Regex to strip out tags and my first attempt <pre class="prettyprint"><code><p\s*.*> </code></pre> was so greedy it caught the whole line <pre class="prettyprint"><code>SomeText </code></pre> I got it to work with <pre class="prettyprint"><code>((.|\s)*?) </code></pre> This seems like it should be just as greedy, can anyone help me understand why it isnt? Trying to make this question as language non-specific as possible, but I was doing this with ColdFusion's reReplaceNoCase if it makes a lot of difference.

The key difference is the <code>*?</code> part, which creates a reluctant quantifier, and so it tries to match as little as possible. The standard quantifier <code>*</code> is a greedy quantifier and tries to match as much as possible. See e.g. Greedy vs. Reluctant vs. Possessive Quantifiers As Seth Robertson noted, you might want to use a regex that does not depend on the greedy/reluctant behaviour. Indeed, you can write a possessive regex for best performance: <pre class="prettyprint"><code><p\s*+[^>]*+> </code></pre> Here, <code>\s*+</code> matches any number of white space, while <code>[^>]*+</code> matches any number of characters except <code>></code>. Both quantifiers do not track back in case of a mismatch, which improves runtime in case of a mismatch, and for some regex implementations also in case of a match (because internal backtracking data can be omitted). Note that, if there are other tags starting with <code><p</code> (didn't write HTML directly for a long time), you match these too. If you don't want that, use a regex like this: <pre class="prettyprint"><code><p(\s++[^>]*+)?> </code></pre> This makes the whole section between <code><p</code> and <code>></code> optional.

In Regex, why is "((.|\s)?)" different than "\s.*"

Tags:

regex

Not a complete newbie, but I still don't understand everything about Regular expressions. I was trying to use Regex to strip out tags and my first attempt

Click to copy

<p\s*.*>

was so greedy it caught the whole line

Click to copy

<p someAttributes='example'>SomeText</p>

I got it to work with

Click to copy

((.|\s)*?)

This seems like it should be just as greedy, can anyone help me understand why it isnt?

Trying to make this question as language non-specific as possible, but I was doing this with ColdFusion's reReplaceNoCase if it makes a lot of difference.

295

asked Jun 06 '11 20:06

invertedSpear

1 Answers

The key difference is the *? part, which creates a reluctant quantifier, and so it tries to match as little as possible. The standard quantifier * is a greedy quantifier and tries to match as much as possible.

See e.g. Greedy vs. Reluctant vs. Possessive Quantifiers

As Seth Robertson noted, you might want to use a regex that does not depend on the greedy/reluctant behaviour. Indeed, you can write a possessive regex for best performance:

Click to copy

<p\s*+[^>]*+>

Here, \s*+ matches any number of white space, while [^>]*+ matches any number of characters except >. Both quantifiers do not track back in case of a mismatch, which improves runtime in case of a mismatch, and for some regex implementations also in case of a match (because internal backtracking data can be omitted).

Note that, if there are other tags starting with <p (didn't write HTML directly for a long time), you match these too. If you don't want that, use a regex like this:

Click to copy

<p(\s++[^>]*+)?>

This makes the whole section between <p and > optional.

183

answered Oct 24 '22 23:10

Christian Semrau

Related questions
                            
                                Is there a regular expression for a comma separated list of discrete values?
                            
                                Convert numbered to accentuated Pinyin?
                            
                                Regex for a-z, 0-9, . and -
                            
                                Matching '_' and '-' in java regexes
                            
                                Differences among .NET Capture, Group, Match
                            
                                Understand this RegEx statement
                            
                                How to determine if a PHP string ONLY contains latitude and longitude
                            
                                Is it a bug in Ecmascript - /\S/.test(null) returns true?
                            
                                Java Split not working as expected
                            
                                Can I shorten this regular expression?
                            
                                How do I include - and ' in this regular expressions?
                            
                                Attach a newline to every sentences
                            
                                Remove characters from beginning and end or only end of line
                            
                                Regular Expression to Match HTML <p> tag using PHP
                            
                                Returning only 0-9 and dashes from string
                            
                                Python regex convert youtube url to youtube video
                            
                                MySQL query to search a field with JSON string
                            
                                PHP regex for Lebanese phone number
                            
                                Why is this grep filter slow?
                            
                                Regular expression problem (extracting one text or another)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Regex, why is "((.|\s)?)" different than "\s.*"

Tags:

regex

invertedSpear

People also ask

1 Answers

Christian Semrau

Recent Activity

Donate For Us

In Regex, why is "((.|\s)*?)" different than "\s*.*"

Tags:

regex

invertedSpear

People also ask

1 Answers

Christian Semrau

Related questions

Recent Activity

Donate For Us

In Regex, why is "((.|\s)?)" different than "\s.*"