Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to parse invalid HTML?

Tags:

c#

.net

xml

I need to parse invalid HTML files that contain several random elements (like BODY) in random lines all over file. I tried to parse it as XML, but with no luck since this file has invalid XML structure as well(a lot of incorrect attributes in random elements over file). HtmlAgilityPack has failed to read this file as well. It's only reading file before first incorrect element and nothing after it.

Here is small example of such file:

<HTML>
<HEAD>
    <TITLE>My title</TITLE>
</HEAD>
<BODY leftmargin=9 topmargin=7 >
    <TABLE>
        <TR>
            <TD>Test</TD>
        </TR>
        <TR>
            <TD>Test</TD>
            <TD>Test<TD>
        </TR>
            <BODY> <-- This is the point where HtmlAgilityPack is stuck --!>
                <TR>
                    <TD>Test</TD>
                    <TD>Test</TD>
                </TR>
                <TR>
            </BODY>
        <TR>
        <TD><FONT>Test</FONT></TD>
        </TR>
    </TABLE>
</BODY>

I'm trying to parse info from that table.

like image 808
Jcf Avatar asked Oct 10 '11 12:10

Jcf


1 Answers

Let Internet Explorer do the hard work for you - it will do its best to "repair" the broken tag structure into something it understands (which is technically valid XML with correct tag pairings etc.).

Open the HTML in WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), then you can walk through the DOM via Document property. The DOM will always be correct, no matter how broken the original source was.

No third party libraries needed.

like image 132
Matěj Zábský Avatar answered Nov 05 '22 06:11

Matěj Zábský