I'm new to XML/HTML-parsing. Don't even know the right words to do a proper search for duplicates.
I have this HTML file which looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">
<span fontFamily="SchoolHouse Cursive B" fontSize="18">I'm great!</span>
</p>
Now I need 00:00:00
, 00:00:29
and I'm great!
from it. I could read it like this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
if (reader.LocalName == "span")
{
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
}
I get values in variables a
, b
and c
. But there was a slight change in HTML format. Now the HTML looks like this:
<body id="s1" style="s1">
<div xml:lang="uk">
<p begin="00:00:00" end="00:00:29">I'm great! </p>
In this scenario how do I parse out 00:00:00
, 00:00:29
and I'm great!
? I tried this:
XmlTextReader reader = new XmlTextReader(file);
while (reader.Read())
{
if (reader.NodeType != XmlNodeType.Element)
continue;
if (reader.LocalName != "p")
continue;
var a = reader.GetAttribute(0);
var b = reader.GetAttribute(1);
XmlDocument doc = new XmlDocument();
doc.Load(reader);
XmlNode elem = doc.DocumentElement.FirstChild;
var c = elem.InnerText;
}
But I get this error: This document already has a 'DocumentElement' node.
at line doc.Load(reader)
. How to read correctly and what's causing the trouble? I am using .NET 2.0
Definition and UsageThe nodeValue property sets or returns the value of a node. If the node is an element node, the nodeValue property will return null.
getElementById(id); if (node === null) { throw new TypeError("No element found with the id: " + id); } textArray = getStringsFromChildren(node); if (textArray === null) { return null; // No text nodes found. Example: <element></element> } textArray = filterWhitespaceLines(textArray); if (textArray.
The nodeValue property reflects the string value of theNode . Since only two node types (element nodes and text nodes) are supported by ActionScript, nodeValue has only two possible values: If theNode is an element node, nodeValue is null .
It looks like you have HTML that you want to parse with a XML parser. That may also be the reason why you get the This document already has a 'DocumentElement' node.
exception: because you have more than one root node, which is allowed (or better: tolerated) in HTML, but not XML.
Use an HTML parser instead. Unfortunatelly there is nothing built-in within the .NET framework. You have to take a third party library for that. A very good one is the HTML agility pack, that oleksii already mentioned in his comment.
Edit:
From your comments, I get the feeling your not familiar with the fact that there is no direct relation between HTML and XML. The graphic taken from here illustrates this quite well:
Neither is XML a subset of HTML, nor the other way around. Only if you have strict XHTML (rarely the case), you have an HTML document that can be parsed with an XML parser. But be aware if there is some mistake in the code of such an XHTML document, the parser will fail, while a common browser will continue to display the page. Also, the future of XHTML is quite unclear, now that HTML5 is comming to life slowly but steadily...
To sum up: To avoid all those pitfalls, take the easy road and go for an HTML parser.
Since you are wanting to parse HTML, you could use WebClient
(or WebBrowser
) to load the page and then use the HTML DOM to navigate through it. You need to add a reference to Microsoft HTML Object Library
(COM) for the following code example:
string html;
WebClient webClient = new WebClient();
using (Stream stream = webClient.OpenRead(new Uri("http://www.google.com")))
using (StreamReader reader = new StreamReader(stream))
{
html = reader.ReadToEnd();
}
IHTMLDocument2 doc = (IHTMLDocument2)new HTMLDocument();
doc.write(html);
foreach (IHTMLElement el in doc.all)
Console.WriteLine(el.tagName);
I have tried loading HTML into XML before, and its all too hard - fixing up unclosed tags (like <BR>), putting quotes around attributes, giving attributes without values a value, etc. Since I wanted to then use XSLT against it, after loading into the HTML DOM and navigated through it creating the relevant XML node for each HTML node. Then I had a proper XML representation of the HTML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With