Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why should we prefer negated character classes to .* in regexes?

I was looking at a tutorial on Regex.

It was about how to get the class attribute from this piece of html

<pre class="ruby" name="code">

and the regex used was

<pre class="([^"]+)" name="code">

They recommended to use the above one instead of

<pre class="(.+)" name="code">

"as it goes beyond the quote."

I don't understand what they mean. It is just going to work anyway but then why the first regex is recommended. Am I missing anything? Please enlighten me.

Thanks in advance.

like image 271
Vigneshwaran Avatar asked Jan 17 '23 23:01

Vigneshwaran


1 Answers

.+ matches greedily. for example, in

<pre class="ruby" size="medium" name="code"> 

it would match ruby" size="medium. Even worse, if you had two tags on the same line, it would match right across the tag boundaries:

<pre class="ruby" name="code">foo</pre> <pre class="python" name="code">bar</pre>

would result in ruby" name="code">foo</pre> <pre class="python!

So as long as you know exactly what your HTML will look like, .+ can work, but as soon as it changes unexpectedly (as HTML is wont to do), your regex wouldn't simply fail (as the second one would) but it would match the wrong stuff.

Therefore, the second regex is safer (since it's more explicit about what exactly is allowed to match). You usually should try and avoid the simple .+ or .* "match anything", and instead think about what you do want to match.

That said, for precisely the same reasons, you shouldn't try and match HTML and other markup languages with regexes anyway because there are better tools for that.

like image 80
Tim Pietzcker Avatar answered Mar 02 '23 22:03

Tim Pietzcker