Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WebDriver can find element using xpath, Html Agility Pack cannot

I have continually had problems with Html Agility Pack; my XPath queries only ever work when they are extremely simple:

//*[@id='some_id']

or

//input

However, anytime they get more complicated, then Html Agility Pack can't handle it. Here's an example demonstrating the problem, I'm using WebDriver to navigate to Google, and return the page source, which is passed to Html Agility Pack, and both WebDriver and HtmlAgilityPack attempt to locate the element/node (C#):

//The XPath query
const string xpath = "//form//tr[1]/td[1]//input[@name='q']";

//Navigate to Google and get page source
var driver = new FirefoxDriver(new FirefoxProfile()) { Url = "http://www.google.com" };
Thread.Sleep(2000);

//Can WebDriver find it?
var e = driver.FindElementByXPath(xpath);
Console.WriteLine(e!=null ? "Webdriver success" : "Webdriver failure");

//Can Html Agility Pack find it?
var source = driver.PageSource;
var htmlDoc = new HtmlDocument { OptionFixNestedTags = true };
htmlDoc.LoadHtml(source);
var nodes = htmlDoc.DocumentNode.SelectNodes(xpath);
Console.WriteLine(nodes!=null ? "Html Agility Pack success" : "Html Agility Pack failure");

driver.Quit();

In this case, WebDriver successfully located the item, but Html Agility Pack did not.

I know, I know, in this case it's very easy to change the xpath to one that will work: //input[@name='q'], but that will only fix this specific example, which isn't the point, I need something that will exactly or at least closely mirror the behavior of WebDriver's xpath engine, or even the FirePath or FireFinder add-ons to Firefox.

If WebDriver can find it, then why can't Html Agility Pack find it too?

like image 640
Anders Avatar asked Jan 20 '23 16:01

Anders


1 Answers

The issue you're running into is with the FORM element. HTML Agility Pack handles that element differently - by default, it will never report that it has children.

In the particular example you gave, this query does find the target element:

.//div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input

However, this does not, so it's clear the form element is tripping up the parser:

.//form/div/div[2]/table/tr/td/table/tr/td/div/table/tr/td/div/div[2]/input

That behavior is configurable, though. If you place this line prior to parsing the HTML, the form will give you child nodes:

HtmlNode.ElementsFlags.Remove("form");
like image 130
hemp Avatar answered Jan 22 '23 04:01

hemp