Can I parse an HTML file using an XML parser? Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical. The intended use is to make an HTML parser, that is part of a web crawler application

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand. <ul> <li>elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <code> </code>, <code><meta></code>, <code><link></code>, and <code><img></code> (also known as void elements)</li> <li>elements that don’t need end tags; e.g., <code></code> <code><dt></code> <code><li></code> (their end tags can be implied)</li> <li>elements that can contain unescaped markup "<code><</code>" characters; e.g., style, textarea, title, script; <code><script> if (a < b) … </script></code>, <code><title>Using the "<" operator</title></code> </li> <li>attributes with unquoted values; for example, <code><meta </code><code>charset=utf-8</code><code>></code> </li> <li>attributes that are empty, with no separate value given at all; e.g., <code><input </code><code>disabled</code><code>></code> </li> </ul> XML parsers will fail to parse any HTML document that uses any of those features. HTML parsers, on the other hand, will basically never fail no matter what a document contains. <hr> All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever. <hr> <blockquote> The intended use is to make an HTML parser, that is part of a web crawler application </blockquote> If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard. These days, there are such conformant HTML parsers for many (or even most) languages; e.g.: <ul> <li> parse5 (node.js/JavaScript)</li> <li> html5lib (python)</li> <li> html5ever (rust)</li> <li> validator.nu html5 parser (java)</li> <li> gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)</li> </ul> <hr>

Parsing an html document using an XML-parser

1 Answers

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g.,  , <meta>, <link>, and <img> (also known as void elements)
elements that don’t need end tags; e.g.,  <dt> <li> (their end tags can be implied)
elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
attributes with unquoted values; for example, <meta charset=utf-8>
attributes that are empty, with no separate value given at all; e.g., <input disabled>

XML parsers will fail to parse any HTML document that uses any of those features.

HTML parsers, on the other hand, will basically never fail no matter what a document contains.

All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.

The intended use is to make an HTML parser, that is part of a web crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:

parse5 (node.js/JavaScript)
html5lib (python)
html5ever (rust)
validator.nu html5 parser (java)
gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)

157

answered Oct 15 '22 03:10

sideshowbarker

Related questions
                            
                                What is inline javascript? [closed]
                            
                                Hide title from tooltip
                            
                                Center vertically a unknown height text in a unknown height div
                            
                                Generate .war file from web app containing just HTML, CSS & JavaScript
                            
                                How are `display: table-cell` widths calculated?
                            
                                place footer at bottom only if page is "short"
                            
                                Why the target property in link has an underscore?
                            
                                `contenteditable = true` height issue in FireFox
                            
                                How to control the anchor landing position
                            
                                how to get value of type="color" input using javascript
                            
                                full screen responsive background image with bootstrap
                            
                                Getting value from data-value attribute in capybara
                            
                                css transition: choose different speed for hover out
                            
                                Add a triangular point to a div that changes with the content height with CSS?
                            
                                Is there something wrong with my srcset definition, or is current browser support just weak?
                            
                                How to use "tags" with Select2
                            
                                How can I change the background color of a button when clicked?
                            
                                Overlaying a transparent background on an embedded video
                            
                                Image prefetching using <link> tag
                            
                                Bad value for attribute action on element form: Must be non-empty

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing an html document using an XML-parser

Tags:

html

parsing

xml

html-parsing

Kent Kostelac

People also ask

1 Answers

sideshowbarker

Recent Activity

Donate For Us