Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read value from HTML node

I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.

I have this HTML file which looks like this:

<body id="s1" style="s1">
    <div xml:lang="uk">
        <p begin="00:00:00" end="00:00:29">
          <span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
        </p>

Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    if (reader.LocalName == "span")
    {
        XmlDocument doc = new XmlDocument();
        doc.Load(reader);
        XmlNode elem = doc.DocumentElement.FirstChild;
        var c = elem.InnerText;
    }
 }

I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:

<body id="s1" style="s1">
  <div xml:lang="uk">
      <p begin="00:00:00" end="00:00:29">I'm great! </p>

In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    XmlDocument doc = new XmlDocument();
    doc.Load(reader);
    XmlNode elem = doc.DocumentElement.FirstChild;
    var c = elem.InnerText;
}

But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0

like image 330
nawfal Avatar asked Jun 30 '12 20:06

nawfal


People also ask

What is node value in HTML?

Definition and UsageThe nodeValue property sets or returns the value of a node. If the node is an element node, the nodeValue property will return null.

How do you access text node?

getElementById(id); if (node === null) { throw new TypeError("No element found with the id: " + id); } textArray = getStringsFromChildren(node); if (textArray === null) { return null; // No text nodes found. Example: <element></element> } textArray = filterWhitespaceLines(textArray); if (textArray.

Why is nodeValue null?

The nodeValue property reflects the string value of theNode . Since only two node types (element nodes and text nodes) are supported by ActionScript, nodeValue has only two possible values: If theNode is an element node, nodeValue is null .


2 Answers

It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.

Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.

Edit:

From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:

Relation between SGML, HTML and XML

Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...

To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.

like image 51
Philip Daubmeier Avatar answered Oct 18 '22 17:10

Philip Daubmeier


Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:

  string html;
  WebClient webClient = new WebClient();
  using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
  using (StreamReader reader = new StreamReader(stream))
  {
    html = reader.ReadToEnd();
  }
  IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
  doc.write(html);
  foreach (IHTMLElement el in doc.all)
    Console.WriteLine(el.tagName);

I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.

like image 43
Michael Avatar answered Oct 18 '22 15:10

Michael