Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

Tags:

html

regex

xml

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag attr="5" /> 

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" /> 

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span>  

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>) <span class="prefix">348</span>-<span class="linenum">3020</span></span> 

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a> <!-- FIXME:     <a href=" --> <a href="bar">bar</a> 

What other gotchas are you aware of?

like image 797
Chas. Owens Avatar asked Mar 31 '09 14:03

Chas. Owens


People also ask

Why you cant parse HTML with regex?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Can XML be parsed with regex?

XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.

Can I use an XML parser to parse HTML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

What is parsing in XML with example?

XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.


1 Answers

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]> <x>     <a b="&y;>" />     <![CDATA[[a>b <a>b <a]]>     <?x <a> <!-- <b> ?> c --> d </x> 

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [     <!ENTITY % e "href='hello'">     <!ENTITY e "<a %e;>"> ]>     <title>x</TITLE> </head>     <p id  =  a:b center>     <span / hello </span>     &amp<br left>     <!---- >t<!---> < -->     &e link </a> </body> 

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd">  <HTML/   <HEAD/     <TITLE/>/     <P/> 
like image 146
bobince Avatar answered Oct 09 '22 18:10

bobince