I'm pulling the source of a website. I then want to extract a specific part of it. My intention is to do this with LINQ-to-XML.
However, I get errors when I parse the source:
XElement source = XElement.Load(reader);
The problem seems to be references to namespaces I don't have. I get the error: 'addthis' is an undeclared prefix. Line 130, position 51.
due to this line:
<div class="addthis_toolbox addthis_pill_combo" addthis:url="http://www.foo.com/foo">
And if I delete that one, other occur.
Thing is, I only care about one piece of this XML file - I don't need to be able to parse the whole file. I just want it in an XElement so I can find that one piece of it. Is there a way for me to hack around the parsing error? And I need a generic solution - I want to parse the file regardless of ANY undeclared prefix
errors.
Thanks
This XML is not valid.
In order to use a namespace prefix (such as addthis:
), the namespace must be declared, by writing xmlns:addthis="some URI"
.
In general, you shouldn't parse HTML using an XML parser, since HTML is likely to be invalid XML, for this reason and a number of other reasons (undeclared entities, unescaped JS, unclosed tags).
Instead, use HTML Agility Pack.
If you need to do it all in code what you want is something like this:
XmlReaderSettings settings = new XmlReaderSettings { NameTable = new NameTable() };
XmlNamespaceManager xmlns = new XmlNamespaceManager(settings.NameTable);
xmlns.AddNamespace("addthis", "");
XmlParserContext context = new XmlParserContext(null, xmlns, "", XmlSpace.Default);
XmlReader reader = XmlReader.Create(new StringReader(text), settings, context);
xmlDoc.Load(reader);
And for any additional prefixes add more of these:
xmlns.AddNamespace("prefix", "");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With