Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read HTML as XML?

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

When I try to load that string using LoadXml(string xml) I get the exception

'--' is an unexpected token. The expected token is '>'

What way should I take to read the html file to a parsable XML

like image 903
Ziv Avatar asked Mar 29 '11 12:03

Ziv


People also ask

How do I convert HTML to XML?

Click on the URL button, Enter URL and Submit. Parsing HTML into XML supports loading the HTML File to transform to XML. Click on the Upload button and select File. HTML to Plain XML Converter Online works well on Windows, MAC, Linux, Google Chrome, Firefox, Edge, and Safari.

Can I parse HTML as XML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

Is HTML a valid XML?

HTML and XML are related to each other, where HTML displays data and describes the structure of a webpage, whereas XML stores and transfers data. HTML is a simple predefined language, while XML is a standard language that defines other languages.

How do I read an XML file?

An XML file is an extensible markup language file, and it is used to structure data for storage and transport. In an XML file, there are both tags and text. The tags provide the structure to the data. The text in the file that you wish to store is surrounded by these tags, which adhere to specific syntax guidelines.


1 Answers

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

like image 89
Konrad Rudolph Avatar answered Oct 08 '22 14:10

Konrad Rudolph