I'm writing simple screen scraping program in C#, for which i need to select all input's placed inside of one single form named "aspnetForm"(there is 2 forms on the page, and i don't want input's from another), and all inputs in this form placed inside different tables, div's, or just at first-child-level of this form.
So i written really simple XPath query:
//form[@id='aspnetForm']//input
It's works as expected in all browsers that i tested(Chrome, IE, Firefox) - it returns what i want.
But in HTMLAgilityPack it's not work at all - SelectNodes just always return NULL.
This queries i've written for tests works fine, but returns not what i want. First select all input's that are first-childs for my form, and second just return's form:
//form[@id='aspnetForm']/input
//form[@id='aspnetForm']
Yes, i know that i can just enumerate over nodes from last query, or make another SelectNodes on it's result, but i don't really want to do this. I want to use same query as in browsers.
Is XPath currently broken in HTMLAgilityPack? There is any alternative XPath implementations for C#?
UPDATE: Test code:
using HtmlAgilityPack;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace HtmlAGPTests
{
[TestClass]
public class XPathTests
{
private const string html =
"<form id=\"aspnetForm\">" +
"<input name=\"first\" value=\"first\" />" +
"<div>" +
"<input name=\"second\" value=\"second\" />" +
"</div>" +
"</form>";
private static HtmlNode GetHtmlDocumentNode()
{
var document = new HtmlDocument();
document.LoadHtml(html);
return document.DocumentNode;
}
[TestMethod]
public void TwoLevelXpathTest() // fail - nodes is NULL actually.
{
var query = "//form[@id='aspnetForm']//input"; // what i want
var documentNode = GetHtmlDocumentNode();
var inputNodes = documentNode.SelectNodes(query);
Assert.IsTrue(inputNodes.Count == 2);
}
[TestMethod]
public void TwoSingleLevelXpathsTest() // works
{
var formQuery = "//form[@id='aspnetForm']";
var inputQuery = "//input";
var documentNode = GetHtmlDocumentNode();
var formNode = documentNode.SelectSingleNode(formQuery);
var inputNodes = formNode.SelectNodes(inputQuery);
Assert.IsTrue(inputNodes.Count == 2);
}
[TestMethod]
public void SingleLevelXpathTest() // works
{
var query = "//form[@id='aspnetForm']";
var documentNode = GetHtmlDocumentNode();
var formNode = documentNode.SelectSingleNode(query);
Assert.IsNotNull(formNode);
}
}
}
The unexpected behavior in your test occur because the html contains <form>
element. Here is related discussion :
Ariman : "I've found that after parsing any node does not have any child nodes. All nodes that should be inside the form (, , etc.) are created as it's siblings rather then children.
VikciaR : "In Html specification form tag can overlap, so Htmlagilitypack handle this node a little different..."
[CodePlex discussion : No child nodes for FORM objects ]
And as suggested by VikciaR there, try to modify your test code initialization like this :
private static HtmlNode GetHtmlDocumentNode()
{
var document = new HtmlDocument();
document.LoadHtml(html);
//execute this line once
HtmlNode.ElementsFlags.Remove("form");
return document.DocumentNode;
}
Side note: inputQuery
value in test method TwoSingleLevelXpathsTest()
should be .//input
. Notice the dot (.
) at the beginning to indicate that this query is relative to current node. Otherwise it will search from the root, ignoring the former formQuery
(without the dot, you can change formQuery
to anything as long as it doesn't return null, the inputQuery
will always return the same result).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With