Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble using Xpath "starts with" to parse xhtml

Tags:

xpath

yql

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format

<div id="post_message_somenumber">

and I only want to get the first one

I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions

like image 686
bigbucky Avatar asked Feb 01 '11 03:02

bigbucky


People also ask

What is text () in XPath?

XPath text() function is a built-in function of the Selenium web driver that locates items based on their text. It aids in the identification of certain text elements as well as the location of those components within a set of text nodes. The elements that need to be found should be in string format.

Can I use XPath on HTML?

XML and HTML Note that HTML and XML have a very similar structure, which is why XPath can be used almost interchangeably to navigate both HTML and XML documents.

What is XPath in scraping?

Xpath is a way to write a pattern that can be matched to a document structure for scraping data. It specifies the parts of a document in a tree structure manner where the parent node is written before the child node inside a pattern.

How XPath is useful for analysis of HTML data explain in brief?

XPath is used to find the location of any element on a webpage using HTML DOM structure. In Selenium automation, if the elements are not found by the general locators like ID, class, name, etc., then XPath is used to find an element on the web page.


2 Answers

I think I have a solution that does not require dealing with namespaces.

Here is one that selects all matching div's:

//div[@id[starts-with(.,"post_message")]]

But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:

(//div[@id[starts-with(.,"post_message")]])[1]

These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.

It works great for me in PowerShell:

# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'

# Run the xpath selection of all matching div's
$xml.selectnodes('//div[@id[starts-with(.,"post_message")]]')

Result:

id
--
post_message_somenumber
post_message_somenumber2

Or, for just the first match:

# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[@id[starts-with(.,"post_message")]])[1]')

Result:

id
--
post_message_somenumber
like image 146
Vimes Avatar answered Oct 11 '22 04:10

Vimes


I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success I'm still learning this, anyone have suggestions

If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.

Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.

Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.

Here is a good answer how to do this in C#.

Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.

like image 25
Dimitre Novatchev Avatar answered Oct 11 '22 05:10

Dimitre Novatchev