Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Losing the 'less than' sign in HtmlAgilityPack loadhtml

I recently started experimenting with the HtmlAgilityPack. I am not familiar with all of its options and I think therefor I am doing something wrong.

I have a string with the following content:

string s = "<span style=\"color: #0000FF;\"><</span>";

You see that in my span I have a 'less than' sign. I process this string with the following code:

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(s);

But when I do a quick and dirty look in the span like this:

htmlDocument.DocumentNode.ChildNodes[0].InnerHtml

I see that the span is empty.

What option do I need to set maintain the 'less than' sign. I already tried this:

htmlDocument.OptionAutoCloseOnEnd = false;
htmlDocument.OptionCheckSyntax = false;
htmlDocument.OptionFixNestedTags = false;

but with no success.

I know it is invalid HTML. I am using this to fix invalid HTML and use HTMLEncode on the 'less than' signs

Please direct me in the right direction. Thanks in advance

like image 378
TurBas Avatar asked Mar 24 '11 15:03

TurBas


3 Answers

The Html Agility Packs detects this as an error and creates an HtmlParseError instance for it. You can read all errors using the ParseErrors of the HtmlDocument class. So, if you run this code:

    string s = "<span style=\"color: #0000FF;\"><</span>";
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);
    doc.Save(Console.Out);

    Console.WriteLine();
    Console.WriteLine();

    foreach (HtmlParseError err in doc.ParseErrors)
    {
        Console.WriteLine("Error");
        Console.WriteLine(" code=" + err.Code);
        Console.WriteLine(" reason=" + err.Reason);
        Console.WriteLine(" text=" + err.SourceText);
        Console.WriteLine(" line=" + err.Line);
        Console.WriteLine(" pos=" + err.StreamPosition);
        Console.WriteLine(" col=" + err.LinePosition);
    }

It will display this (the corrected text first, and details about the error then):

<span style="color: #0000FF;"></span>

Error
 code=EndTagNotRequired
 reason=End tag </> is not required
 text=<
 line=1
 pos=30
 col=31

So you can try to fix this error, as you have all required information (including line, column, and stream position) but the general process of fixing (not detecting) errors in HTML is very complex.

like image 179
Simon Mourier Avatar answered Oct 21 '22 12:10

Simon Mourier


As mentioned in another answer, the best solution I found was to pre-parse the HTML to convert orphaned < symbols to their HTML encoded value &lt;.

return Regex.Replace(html, "<(?![^<]+>)", "&lt;");
like image 40
James Hulse Avatar answered Oct 21 '22 10:10

James Hulse


Fix the markup, because your HTML string is invalid:

string s = "<span style=\"color: #0000FF;\">&lt;</span>";
like image 2
Daniel Hilgarth Avatar answered Oct 21 '22 12:10

Daniel Hilgarth