I have a project where I am taking some particularly ugly "live" HTML and forcing it into a formal XML DOM with the HTML Agility Pack. What I would like to be able to do is then query over this with Linq to XML so that I can scrape out the bits I need. I'm using the method described here to parse the HtmlDocument into an XDocument, but when trying to query over this I'm not sure how to handle namespaces. In one particular document the original HTML was actually poorly formatted XHTML with the following tag:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
When trying to query from this document it seems that the namespace attribute is preventing me from doing something like:
var x = xDoc.Descendants("div");
// returns null
Apparently for those "div" tags only the LocalName is "div", but the proper tag name is the namespace plus "div". I have tried to do some research on the issue of XML namespaces and it seems that I can bypass the namespace by querying this way:
var x =
(from x in xDoc.Descendants()
where x.Name.LocalName == "div"
select x);
// works
However, this seems like a rather hacky solution and does not properly address the namespace issue. As I understand it a proper XML document can contain multiple namespaces and therefore the proper way to handle it should be to parse out the namespaces I'm querying under. Has anyone else ever had to do this? Am I just making it way to complicated? I know that I could avoid all this by just sticking with HtmlDocument and querying with XPath, but I would rather stick to what I know (Linq) if possible and I would also prefer to know that I am not setting myself up for further namespace-related issues down the road.
What is the proper way to deal with namespaces in this situation?
Using LocalName
should be okay. I wouldn't consider it a hack at all if you don't care what namespace it's in.
If you know the namespace you want and you want to specify it, you can:
var ns = "{http://www.w3.org/1999/xhtml}";
var x = xDoc.Root.Descendants(ns + "div");
(MSDN reference)
You can also get a list of all the namespaces used in the document:
var namespaces = (from x in xDoc.Root.DescendantsAndSelf()
select x.Name.Namespace).Distinct();
I suppose you could use that to do this but it's not really any less of a hack:
var x = namespaces.SelectMany(ns=>xDoc.Root.Descendants(ns+"div"));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With