One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:
People want to treat a file as a sequence of lines, but this is valid:
<tag attr="5" />
People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:
<img src="imgtag.gif" alt="<img>" />
People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):
<span id="outer"><span id="inner">foo</span></span>
People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):
<span class="phonenum">(<span class="area code">703</span>) <span class="prefix">348</span>-<span class="linenum">3020</span></span>
Comments may contain poorly formatted or incomplete tags:
<a href="foo">foo</a> <!-- FIXME: <a href=" --> <a href="bar">bar</a>
What other gotchas are you aware of?
HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.
You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.
XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.
Here's some fun valid XML for you:
<!DOCTYPE x [ <!ENTITY y "a]>b"> ]> <x> <a b="&y;>" /> <![CDATA[[a>b <a>b <a]]> <?x <a> <!-- <b> ?> c --> d </x>
And this little bundle of joy is valid HTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [ <!ENTITY % e "href='hello'"> <!ENTITY e "<a %e;>"> ]> <title>x</TITLE> </head> <p id = a:b center> <span / hello </span> &<br left> <!---- >t<!---> < --> &e link </a> </body>
Not to mention all the browser-specific parsing for invalid constructs.
Good luck pitting regex against that!
EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML/ <HEAD/ <TITLE/>/ <P/>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With