Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I prevent AngleSharp from extrapolating a full HTML document when parsing a fragment?

Tags:

anglesharp

Is there anyway to get AngleSharp to not create a full HTML document when parsed a fragment. For example, if I parse:

<title>The Title</title>

I get a full HTML document in DocumentElement.OuterHtml:

<html><head><title>The Title</title></head><body></body></html>

If I parse:

<p>The Paragraph</p>

I get another full HTML document:

<html><head></head><body><p>Hey</p></body></html>

Notice that AngleSharp is smart enough to know where my fragment should go. In one case, it puts it in the HEAD tag, and in the other case, it puts it in the BODY tag.

This is clever, but if I just want the fragment back out, I don't know where to get it. So, I can't just call Body.InnerHtml because depending on the HTML I parsed, my fragment might be in the Head.InnerHtml instead.

Is there a way to get AngleSharp to not create a full document, or is there some other way to get my isolated fragment back out after parsing?

like image 367
Deane Avatar asked Aug 19 '16 16:08

Deane


2 Answers

It is possible now. Below is an example copied from https://github.com/AngleSharp/AngleSharp/issues/594

var fragment = "<script>deane</script><div>deane</div>";
var p = new HtmlParser();
var dom = p.Parse("<html><body></body></html>");
var nodes = p.ParseFragment(fragment, dom.Body);

The second parameter of ParseFragment is used to specify the context in which the fragment is parsed. In your case you will need to parse the <title> in the context of dom.Head and the p in dom.Body.

Oh wow, it is OPs own code which I have just copied.

like image 132
jakubiszon Avatar answered Oct 23 '22 08:10

jakubiszon


I have learned that this is not possible. AngleSharp is designed to generate a DOM exactly like the HTML spec says to do it. If you create an HTML document with the code I have above, open it in a browser, then inspect the DOM, you'll find the exact same situation. AngleSharp is in compliance.

What you can do is parse it as XML with errors suppressed, which should cause the document to self-correct dirty HTML issues, and give you a "clean" document which can then be manipulated.

var html = "<x><y><z>foo</y></z></x>";
var options = new XmlParserOptions()
{
    IsSuppressingErrors = true
};
var dom = new XmlParser(options).Parse(html);

There is one problem in here, in that it doesn't handle entities perfectly (meaning it still throws some errors on these, even when supressed). It's on the list to be fixed.

Here's the GitHub issue that led to this answer:

https://github.com/AngleSharp/AngleSharp/issues/398

like image 2
Deane Avatar answered Oct 23 '22 08:10

Deane