Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing an html document using an XML-parser

Can I parse an HTML file using an XML parser?

Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.

The intended use is to make an HTML parser, that is part of a web crawler application

like image 685
Kent Kostelac Avatar asked Sep 14 '15 20:09

Kent Kostelac


People also ask

Can you parse HTML with XML parser?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

What is XML parser in HTML?

The XML DOM (Document Object Model) defines the properties and methods for accessing and editing XML. However, before an XML document can be accessed, it must be loaded into an XML DOM object. All modern browsers have a built-in XML parser that can convert text into an XML DOM object.

What is XML parser give example?

XML parser is a software library or a package that provides interface for client applications to work with XML documents. It checks for proper format of the XML document and may also validate the XML documents. Modern day browsers have built-in XML parsers. The goal of a parser is to transform XML into a readable code.


1 Answers

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

  • elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
  • attributes with unquoted values; for example, <meta charset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <input disabled>

XML parsers will fail to parse any HTML document that uses any of those features.

HTML parsers, on the other hand, will basically never fail no matter what a document contains.


All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.


The intended use is to make an HTML parser, that is part of a web crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:

  • parse5 (node.js/JavaScript)
  • html5lib (python)
  • html5ever (rust)
  • validator.nu html5 parser (java)
  • gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)

like image 157
sideshowbarker Avatar answered Oct 15 '22 03:10

sideshowbarker