Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data. First, I convert it to XML document using SgmlReader. Then, I load the result to XMLDocument and then navigate through XPath:

//contains html document
var loadedFile = LoadWebPage();

...

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;

sgmlReader.InputStream = new StringReader(loadedFile);

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);

This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:

var node = doc.SelectSingleNode(".//*[@id='results-table']");

This gives me a node with several child nodes:

[0]         {Element, Name="thead"}  
[1]         {Element, Name="tbody"}  
[2]         {Element, Name="tbody"}  
FirstChild   {Element, Name="thead"}

Ok, let's try to get some child nodes using XPath. But this doesn't work:

var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0

This also:

var childNode = node.SelectSingleNode("thead");
// childNode = null

And even this:

var childNode = doc.SelectSingleNode(".//*[@id='results-table']/thead")

What can be wrong in Xpath queries?


I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.


I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:

//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));

XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);

Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.


My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(

like image 469
mlurker Avatar asked Mar 19 '11 03:03

mlurker


1 Answers

I have not used SqmlReader, but every time I have seen this problem it has been due to namespaces. A quick look at the HTML on www.arrow.com shows that this node has a namespace (note the xmlns:javaurlencoder):

<form name="CatSearchForm" method="post" action="http://components.arrow.com/part/search/OP295GS" xmlns:javaurlencoder="java.net.URLEncoder">

This code is how I loop through all nodes in a document to see which ones have namespaces and which don't. If the node you are looking for or any of its parents have namespaces, you must create a XmlNamespaceManager and pass it along with your call to SelectNodes().

This is kind of annoying, so another idea might be to strip all the xmlns: attributes out of the XML before loading it into a XmlDocument. Then, you won't need to fool with XmlNamespaceManager!

XmlDocument doc = new XmlDocument();
doc.Load(@"C:\temp\X.loadtest.xml");

Dictionary<string, string> namespaces = new Dictionary<string, string>();
XmlNodeList nlAllNodes = doc.SelectNodes("//*");
foreach (XmlNode n in nlAllNodes)
{
    if (n.NodeType != XmlNodeType.Element) continue;

    if (!String.IsNullOrEmpty(n.NamespaceURI) && !namespaces.ContainsKey(n.Name))
    {
        namespaces.Add(n.Name, n.NamespaceURI);
    }
}

// Inspect the namespaces dictionary to write the code below

XmlNamespaceManager nMgr = new XmlNamespaceManager(doc.NameTable);
// Sometimes this works
nMgr.AddNamespace("ns1", doc.DocumentElement.NamespaceURI); 
// You can make the first param whatever you want, it just must match in XPath queries
nMgr.AddNamespace("javaurlencoder", "java.net.URLEncoder"); 

XmlNodeList iter = doc.SelectNodes("//ns1:TestProfile", nMgr);
foreach (XmlNode n in iter)
{
    // Do stuff
}
like image 112
Matt Faus Avatar answered Sep 27 '22 17:09

Matt Faus