this regular expression should match an html start tag, I think. <code>var results = html.match(/<(\/?)(\w+)([^>]*?)>/);</code> I see it should first capture the <code><</code>, but then I am confused what this capture <code>(\/?)</code> accomplishes. Am I correct in reasoning that the <code>([^>]*?)></code> searches for every character except <code>></code> >= 0 times? If so, why is the <code>(\w+)</code> capture necessary? Doesn't it fall within the purview of <code>[^>]*?</code>

Using the power of debuggex to generate you an image :) <pre class="prettyprint"><code><(\/?)(\w+)([^>]*?)> </code></pre> Will be evaluated like this <img src="https://www.debuggex.com/i/hnhZ3pDQrgvXlpHg.png" alt="Regular expression image"> Edit live on Debuggex As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following: <ol> <li> <code>(\/?)</code> existence of <code>/</code> (it's a closing tag, if present)</li> <li> <code>(\w+)</code> name of the tag</li> <li> <code>([^>]*?)</code> everything else until the tag closes (e.g. attributes)</li> </ol> This way it matches <code><a href="#"></code>. Interestingly it does not match <code><a data-fun="fun>nofun"></code> correctly because it stops at the <code>></code> within the <code>data-fun</code> attribute. Although (I think) <code>></code> is valid in an attribute value. Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows <code>Letter | Digit | '.' | '-' | '_' | ':' | ..</code> (source: XHTML spec). <code>(\w+)</code>, however, does not match <code>.</code>, <code>-</code>, and <code>:</code>. An imaginary <code><.foobar></code> tag will not be matched by this regex. This should not have any real life impact, though. You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

meaning of (\/?) in regex / is (\w+)([^>]*?) a redundancy?

this regular expression should match an html start tag, I think.

var results = html.match(/<(\/?)(\w+)([^>]*?)>/);

I see it should first capture the <, but then I am confused what this capture (\/?) accomplishes. Am I correct in reasoning that the ([^>]*?)> searches for every character except > >= 0 times? If so, why is the (\w+) capture necessary? Doesn't it fall within the purview of [^>]*?

What does W * mean in regex?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.

What is W in python regex?

\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

What is the use of class W in regex?

Definition and Usage The \w metacharacter matches word characters. A word character is a character a-z, A-Z, 0-9, including _ (underscore).

Take it token by token:

/ begin regex literal
< match a literal <
(\/?) match 0 or 1 (?) literal /, which is escaped by the \
(\w+) match one or more "word characters"
([^>]*?) lazily* match zero or more (*?) of anything that is not a >
> match a literal >
/ end regex literal

lazily* - adding "?" after a repetition quantifier will make it perform lazily, meaning the regex will match the preceding token the minimum number of times. See the documentation.

So essentially this regular expression will match "<", potentially followed by a "/", followed by any number of letters, digits, or underscores, followed by anything that is not a ">", and finally followed by a ">".

That being said, the token (\w+) is not redundant, as it ensures there is at least one word character in between < and >.

Please be aware that attempting to parse HTML with regular expressions is generally a bad idea.

Using the power of debuggex to generate you an image :)

<(\/?)(\w+)([^>]*?)>

Will be evaluated like this

Regular expression image

Edit live on Debuggex

As you can see, it matches HTML-tags (opening and closing tags). The regex contains three capture groups, capturing the following:

(\/?) existence of / (it's a closing tag, if present)
(\w+) name of the tag
([^>]*?) everything else until the tag closes (e.g. attributes)

This way it matches <a href="#">. Interestingly it does not match <a data-fun="fun>nofun"> correctly because it stops at the > within the data-fun attribute. Although (I think) > is valid in an attribute value.

Another funny thing is, that the tag-name capture, does not capture all theoretically valid XHTML tags. XHTML allows Letter | Digit | '.' | '-' | '_' | ':' | .. (source: XHTML spec). (\w+), however, does not match ., -, and :. An imaginary <.foobar> tag will not be matched by this regex. This should not have any real life impact, though.

You see that parsing HTML using RgExes is a risky thing. You might be better of with a HTML parser.

meaning of (\/?) in regex / is (\w+)([^>]*?) a redundancy?

Tags:

javascript

regex

1252748

People also ask

2 Answers

jbabey

tessi

Recent Activity

Donate For Us

meaning of (\/?) in regex / is (\w+)([^>]*?) a redundancy?

Tags:

javascript

regex

1252748

People also ask

2 Answers

jbabey

tessi

Related questions

Recent Activity

Donate For Us