Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HtmlAgilityPack: Could someone please explain exactly what is the effect of setting the HtmlDocument OptionAutoCloseOnEnd to true?

The current documentation says:

Defines if closing for non closed nodes must be done at the end or directly in the document. Setting this to true can actually change how browsers render the page. Default is false.

Sorry, I have to admit I do not understand this paragraph. Specifically "at the end" of what? And what does "in the document" mean exactly? The phrase before the last one sounds ominous. If the option is set to true and if the html is formatted properly is this still going to affect the document?

I looked in the source code but I did not understand what's happening - the code reacts to the property not being set to true. See HtmlNode.cs, and search for OptionAutoCloseOnEnd - line 1707. I also found some funky code in HtmlWeb.cs at lines 1113 and 1154. Too bad the source code browser doesn't show line numbers but search for OptionAutoCloseOnEnd in the page.

Could you please illustrate with an example what this option does?

I am using the HtmlAgilityPack to fix some bad html and to export the page content to xml.

I came across some badly formatted html - overlapping tags. Here is the snippet:

<p>Blah bah
<P><STRONG>Some Text</STRONG><STRONG></p>
<UL>
<LI></STRONG>Item 1.</LI>
<LI>Item 2</LI>
<LI>Item 3</LI></UL>

Note that the first p tag is not closed and note the overlapping STRONG tag.

If I set OptionAutoCloseOnEnd this gets somehow fixed. I am trying to understand what exactly is the effect of setting this property to true in general in the structure of the document.

Here is the C# code that I am using:

HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
doc.OptionFixNestedTags = true;      
//  doc.OptionAutoCloseOnEnd = true;    
doc.LoadHtml(htmlText);

Thank you!

like image 738
costa Avatar asked Nov 03 '16 01:11

costa


2 Answers

The current code always closes the unclosed nodes just before the parent node is closed. So the following code

var doc = new HtmlDocument();
doc.LoadHtml("<x>hello<y>world</x>");
doc.Save(Console.Out);

will output this (the unclosed <y> is closed before the parent <x> is closed)

<x>hello<y>world</y></x>

Originally, the option, when set, was meant to be able to produce this instead (not for XML output types):

<x>hello<y>world</x></y>

with the closing <y> set at the end of the document (that's what the "end" means). Note in this case, you can still get overlapping elements.

This feature (maybe useless I can admit that) was broken somewhere in the past, I don't know why.

Note <p> tag case is special as it's by default being governed by custom HtmlElementFlag. This is how it's declared in HtmlNode.cs:

ElementsFlags.Add("p", HtmlElementFlag.Empty | HtmlElementFlag.Closed);
like image 147
Simon Mourier Avatar answered Sep 28 '22 22:09

Simon Mourier


The better way to use HtmlAgilityPack would be to open and close tags programmatically wherever required and to set :

 doc.OptionAutoCloseOnEnd = false;

Which will give you the expected formatting.

Otherwise, the library will check for any tags that are not closed, and close them wherever it feels suitable as per your code execution flow.

like image 21
Jose Francis Avatar answered Sep 28 '22 21:09

Jose Francis