Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I load only specific elements in AngleSharp?

I'm using AngleSharp to parse HTML5 at the moment what I'm doing is wrapping the elements I want to parse with a little bit of HTML to make it a valid HTML5 and then use the parser on that, is there a better of doing it? meaning, parsing specific elements directly and validate that the structure is indeed HTML5?

like image 598
Eyal Alon Avatar asked Jan 10 '23 01:01

Eyal Alon


1 Answers

Hm, a little example would be nice. But AngleSharp does support fragment parsing, which sounds like the thing you want. In general fragment parsing is also applied when you set properties like InnerHtml, which transform strings to DOM nodes.

You can use the ParseFragment method of the HtmlParser class to get a list of nodes contained in the given source code. An example:

using AngleSharp.Parser.Html;
// ...

var source = "<div><span class=emphasized>Works!</span></div>";
var parser = new HtmlParser();
var nodes = parser.ParseFragment(source, null);//null = no context given

if (nodes.Length == 0)
    Debug.WriteLine("Apparently something bad happened...");

foreach (var node in nodes)
{
    // Examine the node
}

Usually all nodes will be IText or IElement types. Also comments (IComment) are possible. You will never see IDocument or IDocumentFragment nodes attached to such an INodeList. However, since HTML5 is quite robust it is very likely that you will never experience "errors" using this method.

What you can do is to look for (parsing) errors. You need to provide an IConfiguration that exposes an event aggregator, which collects such events. The simplest implementation for aggregating only such events (without possibility of adding / removing multiple handlers) is the following:

using AngleSharp.Events;
// ...

class SimpleEventAggregator : IEventAggregator
{
    readonly List<HtmlParseErrorEvent> _errors = new List<HtmlParseErrorEvent>();

    public void Publish<TEvent>(TEvent data)
    {
        var error = data as HtmlParseErrorEvent;

        if (error != null)
            _errors.Add(error);
    }

    public List<HtmlParseErrorEvent> Errors
    {
        get { return _errors; }
    }

    public void Subscribe<TEvent>(ISubscriber<TEvent> listener) { }

    public void Unsubscribe<TEvent>(ISubscriber<TEvent> listener) { }
}

The simplest way to use the event aggregator with a configuration is to instantiate a new (provided) Configuration. Here as a sample snippet.

using AngleSharp;
// ...

var errorEvents = new SimpleEventAggregator();
var config = new Configuration(events: errorEvents);

Please note: Every error that is reported is an "official" error (according to W3C spec.). These errors do not indicate that the provided code is malicious or invalid, just that something is not following the spec and that a fallback had to be applied.

Hope this answers your question. If not, then please let me know.

Update Updated the answer for the latest version of AngleSharp.

like image 122
Florian Rappl Avatar answered Jan 17 '23 20:01

Florian Rappl