Is there any way to parse invalid HTML?

Question

I need to parse invalid HTML files that contain several random elements (like BODY) in random lines all over file. I tried to parse it as XML, but with no luck since this file has invalid XML structure as well(a lot of incorrect attributes in random elements over file). HtmlAgilityPack has failed to read this file as well. It's only reading file before first incorrect element and nothing after it.

Here is small example of such file:

<HTML>
<HEAD>
    <TITLE>My title</TITLE>
</HEAD>
<BODY leftmargin=9 topmargin=7 >
    <TABLE>
        <TR>
            <TD>Test</TD>
        </TR>
        <TR>
            <TD>Test</TD>
            <TD>Test<TD>
        </TR>
            <BODY> <-- This is the point where HtmlAgilityPack is stuck --!>
                <TR>
                    <TD>Test</TD>
                    <TD>Test</TD>
                </TR>
                <TR>
            </BODY>
        <TR>
        <TD><FONT>Test</FONT></TD>
        </TR>
    </TABLE>
</BODY>

I'm trying to parse info from that table.

Matěj Zábský · Accepted Answer

Let Internet Explorer do the hard work for you - it will do its best to "repair" the broken tag structure into something it understands (which is technically valid XML with correct tag pairings etc.).

Open the HTML in WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), then you can walk through the DOM via Document property. The DOM will always be correct, no matter how broken the original source was.

No third party libraries needed.

Is there any way to parse invalid HTML?

Tags:

c#

.net

xml

Jcf

1 Answers

Matěj Zábský

Recent Activity

Donate For Us

Is there any way to parse invalid HTML?

Tags:

c#

.net

xml

Jcf

1 Answers

Matěj Zábský

Related questions

Recent Activity

Donate For Us