Read value from HTML node

Tags:

I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.

I have this HTML file which looks like this:

<body id="s1" style="s1">
    <div xml:lang="uk">
        <p begin="00:00:00" end="00:00:29">
          <span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
        </p>

Now I need 00:00:00, 00:00:29 and I'm great! from it. I could read it like this:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    if (reader.LocalName == "span")
    {
        XmlDocument doc = new XmlDocument();
        doc.Load(reader);
        XmlNode elem = doc.DocumentElement.FirstChild;
        var c = elem.InnerText;
    }
 }

I get values in variables a, b and c. But there was a slight change in HTML format. Now the HTML looks like this:

<body id="s1" style="s1">
  <div xml:lang="uk">
      <p begin="00:00:00" end="00:00:29">I'm great! </p>

In this scenario how do I parse out 00:00:00, 00:00:29 and I'm great! ? I tried this:

XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Element)
        continue;

    if (reader.LocalName != "p")
        continue;

    var a = reader.GetAttribute(0);
    var b = reader.GetAttribute(1);

    XmlDocument doc = new XmlDocument();
    doc.Load(reader);
    XmlNode elem = doc.DocumentElement.FirstChild;
    var c = elem.InnerText;
}

But I get this error: This document already has a 'DocumentElement' node. at line doc.Load(reader). How to read correctly and what's causing the trouble? I am using .NET 2.0

330

asked Jun 30 '12 20:06

nawfal

2 Answers

It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node. exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.

Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.

Edit:

From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:

Relation between SGML, HTML and XML

Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...

To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.

answered Oct 18 '22 17:10

Philip Daubmeier

Since you are wanting to parse HTML, you could use WebClient (or WebBrowser) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library (COM) for the following code example:

  string html;
  WebClient webClient = new WebClient();
  using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
  using (StreamReader reader = new StreamReader(stream))
  {
    html = reader.ReadToEnd();
  }
  IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
  doc.write(html);
  foreach (IHTMLElement el in doc.all)
    Console.WriteLine(el.tagName);

I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.

answered Oct 18 '22 15:10

Michael

Related questions
                            
                                How CLR can bypass throwing error when assigning a null value to the struct?
                            
                                Insert a new field into an existing collection (many documents) in MongoDB using C# driver
                            
                                How do I get the iOS Device and Version in Monotouch
                            
                                Open file with association
                            
                                WebDriver and C# - NoSuchElement exception
                            
                                prevent method from executing from different threads at the same time
                            
                                What is the proper way to store a file name in XML?
                            
                                C++ exceptions vs. C# exceptions
                            
                                Greying out a checkbox when another is checked
                            
                                Bind multiple views to multiple viewmodels
                            
                                EF Include always produces INNER JOIN for the first Navigation Property
                            
                                Finding the interface index in C#
                            
                                Invalid attempt to access field before calling read()
                            
                                Concurrent web request performance issues
                            
                                Property Dependency Injection used in Constructor using Unity
                            
                                ComboBox SelectedValue or SelectedItem Binding WPF C#
                            
                                Simple Selecting a row from gridview in asp.net web application
                            
                                Multiple FileStreams for reading/writing different sections of same file concurrently in independent threads
                            
                                How to define a namespace-wide C# alias?
                            
                                Where in code should one keep data that doesn't change?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read value from HTML node

Tags:

html

c#

html-parsing

nawfal

People also ask

2 Answers

Philip Daubmeier

Michael

Recent Activity

Donate For Us