Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath select's in HTMLAgilityPack don't work as expected

I'm writing simple screen scraping program in C#, for which i need to select all input's placed inside of one single form named "aspnetForm"(there is 2 forms on the page, and i don't want input's from another), and all inputs in this form placed inside different tables, div's, or just at first-child-level of this form.

So i written really simple XPath query:

//form[@id='aspnetForm']//input

It's works as expected in all browsers that i tested(Chrome, IE, Firefox) - it returns what i want.

But in HTMLAgilityPack it's not work at all - SelectNodes just always return NULL.

This queries i've written for tests works fine, but returns not what i want. First select all input's that are first-childs for my form, and second just return's form:

//form[@id='aspnetForm']/input
//form[@id='aspnetForm']

Yes, i know that i can just enumerate over nodes from last query, or make another SelectNodes on it's result, but i don't really want to do this. I want to use same query as in browsers.

Is XPath currently broken in HTMLAgilityPack? There is any alternative XPath implementations for C#?

UPDATE: Test code:

using HtmlAgilityPack;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace HtmlAGPTests
{
    [TestClass]
    public class XPathTests
    {
        private const string html =
                "<form id=\"aspnetForm\">" +
                "<input name=\"first\" value=\"first\" />" +
                "<div>" +
                    "<input name=\"second\" value=\"second\" />" +
                "</div>" +
                "</form>";

        private static HtmlNode GetHtmlDocumentNode()
        {
            var document = new HtmlDocument();
            document.LoadHtml(html);
            return document.DocumentNode;
        }

        [TestMethod]
        public void TwoLevelXpathTest()     // fail - nodes is NULL actually.
        {
            var query = "//form[@id='aspnetForm']//input";  // what i want
            var documentNode = GetHtmlDocumentNode();

            var inputNodes = documentNode.SelectNodes(query);

            Assert.IsTrue(inputNodes.Count == 2);
        }

        [TestMethod]
        public void TwoSingleLevelXpathsTest()     // works
        {
            var formQuery = "//form[@id='aspnetForm']";
            var inputQuery = "//input";
            var documentNode = GetHtmlDocumentNode();

            var formNode = documentNode.SelectSingleNode(formQuery);
            var inputNodes = formNode.SelectNodes(inputQuery);

            Assert.IsTrue(inputNodes.Count == 2);
        }

        [TestMethod]
        public void SingleLevelXpathTest()     // works
        {
            var query = "//form[@id='aspnetForm']";
            var documentNode = GetHtmlDocumentNode();

            var formNode = documentNode.SelectSingleNode(query);

            Assert.IsNotNull(formNode);
        }

    }
}
like image 804
rufanov Avatar asked Sep 30 '22 23:09

rufanov


1 Answers

The unexpected behavior in your test occur because the html contains <form> element. Here is related discussion :

Ariman : "I've found that after parsing any node does not have any child nodes. All nodes that should be inside the form (, , etc.) are created as it's siblings rather then children.

VikciaR : "In Html specification form tag can overlap, so Htmlagilitypack handle this node a little different..."

[CodePlex discussion : No child nodes for FORM objects ]

And as suggested by VikciaR there, try to modify your test code initialization like this :

private static HtmlNode GetHtmlDocumentNode()
{
    var document = new HtmlDocument();
    document.LoadHtml(html);
    
    //execute this line once
    HtmlNode.ElementsFlags.Remove("form");
    
    return document.DocumentNode;
}

Side note: inputQuery value in test method TwoSingleLevelXpathsTest() should be .//input. Notice the dot (.) at the beginning to indicate that this query is relative to current node. Otherwise it will search from the root, ignoring the former formQuery (without the dot, you can change formQuery to anything as long as it doesn't return null, the inputQuery will always return the same result).

like image 71
har07 Avatar answered Oct 11 '22 18:10

har07