I'm trying to extract the attributes of a anchor tag (<a>
). So far I have this expression:
(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+
which works for strings like
<a href="test.html" class="xyz">
and (single quotes)
<a href='test.html' class="xyz">
but not for a string without quotes:
<a href=test.html class=xyz>
How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?
Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.
The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.
Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com
did not exist when I wrote originally this answer)
<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?
Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />
, as in this regex101:
(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com
did not exist when I wrote originally this answer)
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Applied to:
<a href=test.html class=xyz> <a href="test.html" class="xyz"> <a href='test.html' class="xyz"> <script type="text/javascript" defer async id="something" onload="alert('hello');"></script> <img src="test.png"> <img src="a test.png"> <img src=test.png /> <img src=a test.png /> <img src=test.png > <img src=a test.png > <img src=test.png alt=crap > <img src=a test.png alt=crap >
Original answer (2008): If you have an element like
<name attribute=value attribute="value" attribute='value'>
this regex could be used to find successively each attribute name and value
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Applied on:
<a href=test.html class=xyz> <a href="test.html" class="xyz"> <a href='test.html' class="xyz">
it would yield:
'href' => 'test.html' 'class' => 'xyz'
Note: This does not work with numeric attribute values e.g.
<div id="1">
won't work.Edited: Improved regex for getting attributes with no value and values with " ' " inside.
([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?
Applied on:
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
it would yield:
'type' => 'text/javascript' 'defer' => '' 'async' => '' 'id' => 'something' 'onload' => 'alert(\'hello\');'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With