Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for extracting tag attributes

Tags:

html

regex

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:

(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+ 

which works for strings like

<a href="test.html" class="xyz"> 

and (single quotes)

<a href='test.html' class="xyz"> 

but not for a string without quotes:

<a href=test.html class=xyz> 

How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?

Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.

like image 216
splattne Avatar asked Nov 25 '08 11:11

splattne


People also ask

What is [] in regular expression?

The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a [] . It provides a range construct. The regex [a-z] will match any letter a through z.


1 Answers

Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)

<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1? 

Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:

(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']? 

Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']? 

Applied to:

<a href=test.html class=xyz> <a href="test.html" class="xyz"> <a href='test.html' class="xyz"> <script type="text/javascript" defer async id="something" onload="alert('hello');"></script> <img src="test.png"> <img src="a test.png"> <img src=test.png /> <img src=a test.png /> <img src=test.png > <img src=a test.png > <img src=test.png alt=crap > <img src=a test.png alt=crap > 

Original answer (2008): If you have an element like

<name attribute=value attribute="value" attribute='value'> 

this regex could be used to find successively each attribute name and value

(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? 

Applied on:

<a href=test.html class=xyz> <a href="test.html" class="xyz"> <a href='test.html' class="xyz"> 

it would yield:

'href' => 'test.html' 'class' => 'xyz' 

Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.

Edited: Improved regex for getting attributes with no value and values with " ' " inside.

([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)? 

Applied on:

<script type="text/javascript" defer async id="something" onload="alert('hello');"></script> 

it would yield:

'type' => 'text/javascript' 'defer' => '' 'async' => '' 'id' => 'something' 'onload' => 'alert(\'hello\');' 
like image 170
VonC Avatar answered Sep 22 '22 23:09

VonC