Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML Agility Pack - Issue selecting an HTML select tag with the option tags within

I am using HTML Agility Pack to select an element and return that element and everything it contains from an html string that is loaded. In testing my code, I ran it against the select tag example from w3schools:

<select name="cars">
  <option value="volvo">Volvo XC90</option>
  <option value="saab">Saab 95</option>
  <option value="mercedes">Mercedes SLK</option>
  <option value="audi">Audi TT</option>
</select>

When I try to select and return this with HTML agility pack, I get (option closing tags removed):

<select name="cars">
  <option value="volvo">Volvo XC90
  <option value="saab">Saab 95
  <option value="mercedes">Mercedes SLK
  <option value="audi">Audi TT
</select>

So I did some searching here and found an instruction to add the line: HtmlNode.ElementsFlags.Remove("option");

I did that, and now I get (the options text is moved outside of the option tags):

<select name="cars">
  <option value="volvo"></option>Volvo XC90
  <option value="saab"></option>Saab 95
  <option value="mercedes"></option>Mercedes SLK
  <option value="audi"></option>Audi TT
</select>

I would like the output to match the original HTML. What do I need to do to get that?

I was also playing with the OptionWriteEmptyNodes as when I tested with input tags their self closing was being removed, adding that option seemed to fix that. I commented it out now to make sure it wasn't impacting this issue.

This is my .NET C# code:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
HtmlNode.ElementsFlags.Remove("option"); // otherwise, the closing tag is removed.

//doc.OptionWriteEmptyNodes = true;

var nodes = doc.DocumentNode.SelectNodes("//select");

if (nodes == null)
    return "Not found";
else
    return nodes[0].OuterHtml;
like image 393
Jon H Avatar asked Oct 04 '22 09:10

Jon H


1 Answers

You need to set the ElementsFlag field for the option tag to make it work

HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

which should return your original HTML code.

I believe the reason that HtmlAgilityPack behaves this way is because the <option>-tag is ironically an optional tag in HTML that doesn't require a closing tag.

Taken from the documentation of the HtmlNode class and it's field ElementsFlags:

Gets a collection of flags that define specific behaviors for specific element nodes. The table contains a DictionaryEntry list with the lowercase tag name as the Key, and a combination of HtmlElementFlags as the Value.

Further look into the HtmlElementFlag enums reveal this:

Empty - The node is empty. META or IMG are example of such nodes. Closed - The node will automatically be closed during parsing.

You can view the source code for the class HtmlNode to see what other tags are considered 'specific'.

like image 156
Daniel B Avatar answered Oct 12 '22 11:10

Daniel B