One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard: People want to treat a file as a sequence of lines, but this is valid: <pre class="prettyprint"><code><tag attr="5" /> </code></pre> People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild: <pre class="prettyprint"><code><img src="imgtag.gif" alt="<img>" /> </code></pre> People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all): <pre class="prettyprint"><code>foo </code></pre> People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed): <pre class="prettyprint"><code>(703) 348-3020 </code></pre> Comments may contain poorly formatted or incomplete tags: <pre class="prettyprint"><code><a href="foo">foo</a>  <a href="bar">bar</a> </code></pre> What other gotchas are you aware of?

Here's some fun valid XML for you: <pre class="prettyprint"><code><!DOCTYPE x [ <!ENTITY y "a]>b"> ]> <x> <a b="&y;>" /> <![CDATA[[a>b <a>b <a]]> <?x <a>  d </x> </code></pre> And this little bundle of joy is valid HTML: <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [ <!ENTITY % e "href='hello'"> <!ENTITY e "<a %e;>"> ]> <title>x</TITLE> </head> &amp  < --> &e link </a> </body> </code></pre> Not to mention all the browser-specific parsing for invalid constructs. Good luck pitting regex against that! EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01: <pre class="prettyprint"><code><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <HTML/ <HEAD/ <TITLE/>/ </code></pre>

Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]

Q: Why you cant parse HTML with regex?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

Q: Can XML be parsed with regex?

XML is not a regular language (that's a technical term) so you will never be able to parse it correctly using a regular expression.

Q: Can I use an XML parser to parse HTML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

Q: What is parsing in XML with example?

XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.

Tags:

html

regex

xml

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag attr="5" />

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span>

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>) <span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a> <!-- FIXME:     <a href=" --> <a href="bar">bar</a>

What other gotchas are you aware of?

797

asked Mar 31 '09 14:03

Chas. Owens

1 Answers

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]> <x>     <a b="&y;>" />     <![CDATA[[a>b <a>b <a]]>     <?x <a> <!-- <b> ?> c --> d </x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [     <!ENTITY % e "href='hello'">     <!ENTITY e "<a %e;>"> ]>     <title>x</TITLE> </head>     <p id  =  a:b center>     <span / hello </span>     &amp<br left>     <!---- >t<!---> < -->     &e link </a> </body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd">  <HTML/   <HEAD/     <TITLE/>/     <P/>

146

answered Oct 09 '22 18:10

bobince

Related questions
                            
                                CSS Display an Image Resized and Cropped
                            
                                Line break in HTML with '\n'
                            
                                How do you create a hidden div that doesn't create a line break or horizontal space?
                            
                                How to create a <style> tag with Javascript?
                            
                                When do items in HTML5 local storage expire?
                            
                                Turn off iPhone/Safari input element rounding
                            
                                How to set a value to a file input in HTML?
                            
                                CSS technique for a horizontal line with words in the middle
                            
                                How do I add a tool tip to a span element?
                            
                                jQuery removeClass wildcard
                            
                                CSS selector for text input fields?
                            
                                href="tel:" and mobile numbers
                            
                                What is the meaning of polyfills in HTML5?
                            
                                What is the correct syntax of ng-include?
                            
                                Slide right to left?
                            
                                How to ignore HTML element from tabindex?
                            
                                CSS background image to fit width, height should auto-scale in proportion
                            
                                What are the integrity and crossorigin attributes?
                            
                                How can I limit possible inputs in a HTML5 "number" element?
                            
                                Options for HTML scraping? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With