I'm trying to parse a webpage to get posts from a forum. The start of each message starts with the following format <pre class="prettyprint"><code><div id="post_message_somenumber"> </code></pre> and I only want to get the first one I tried <code>xpath='//div[starts-with(@id, '"post_message_')]'</code> in yql without success I'm still learning this, anyone have suggestions

I think I have a solution that does not require dealing with namespaces. Here is one that selects all matching div's: <pre class="prettyprint lang-none prettyprint-override"><code>//div[@id[starts-with(.,"post_message")]] </code></pre> But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result: <pre class="prettyprint lang-none prettyprint-override"><code>(//div[@id[starts-with(.,"post_message")]])[1] </code></pre> These use the dot to represent the id's value within the <code>starts-with()</code> function. You may have to escape special characters in your language. It works great for me in PowerShell: <pre class="prettyprint lang-none prettyprint-override"><code># Load a sample xml document $xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>' # Run the xpath selection of all matching div's $xml.selectnodes('//div[@id[starts-with(.,"post_message")]]') </code></pre> Result: <pre class="prettyprint lang-none prettyprint-override"><code>id -- post_message_somenumber post_message_somenumber2 </code></pre> Or, for just the first match: <pre class="prettyprint lang-none prettyprint-override"><code># Run the xpath selection of the first matching div $xml.selectnodes('(//div[@id[starts-with(.,"post_message")]])[1]') </code></pre> Result: <pre class="prettyprint lang-none prettyprint-override"><code>id -- post_message_somenumber </code></pre>

Trouble using Xpath "starts with" to parse xhtml

Tags:

xpath

yql

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format

<div id="post_message_somenumber">

and I only want to get the first one

I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions

686

asked Feb 01 '11 03:02

bigbucky

2 Answers

I think I have a solution that does not require dealing with namespaces.

Here is one that selects all matching div's:

//div[@id[starts-with(.,"post_message")]]

But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:

(//div[@id[starts-with(.,"post_message")]])[1]

These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.

It works great for me in PowerShell:

# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'

# Run the xpath selection of all matching div's
$xml.selectnodes('//div[@id[starts-with(.,"post_message")]]')

Result:

id
--
post_message_somenumber
post_message_somenumber2

Or, for just the first match:

# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[@id[starts-with(.,"post_message")]])[1]')

Result:

id
--
post_message_somenumber

146

answered Oct 11 '22 04:10

Vimes

I tried xpath='//div[starts-with(@id, '"post_message_')]' in yql without success I'm still learning this, anyone have suggestions

If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.

Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.

Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.

Here is a good answer how to do this in C#.

Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.

answered Oct 11 '22 05:10

Dimitre Novatchev

Related questions
                            
                                Python lxml XPath problem
                            
                                How do I select all the columns first cells (tds) in Selenium webdriver?
                            
                                Using Linq To XML, method to get path to all leaves?
                            
                                XPath to select only child elements (not blank text nodes)
                            
                                How to select parent based on the child in lxml?
                            
                                Get all ancestors of current node
                            
                                Matching element id's with JSF dynamic view Id
                            
                                xpath - how to find an embedded li with an input element inside it?
                            
                                Get element inside current element using xpath
                            
                                Selenium WebElement xpath Java
                            
                                How to suppress over-verbose xpath-output?
                            
                                How to iterate XML nested elements with Nokogiri in Ruby
                            
                                Easiest way to read in svg path data with Java? [duplicate]
                            
                                How to extract or Update xml attribute values in oracle
                            
                                PowerShell Script to Delete XML Element
                            
                                org.openqa.selenium.InvalidSelectorException - [object Text]. It should be an element
                            
                                How Can I Scrape Twitter Now That They Require JavaScript?
                            
                                Combining Conditions in XPath and XSLT
                            
                                XPath to get Unique Element Names
                            
                                What’s the XPath expression to select an attribute based on its prefix?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With