Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xpath expression for regex-like matching?

Tags:

regex

ruby

xpath

I want to search div id in an html doc with certain pattern. I want to match this pattern in regex:

foo_([[:digit:]]{1.8})

using xpath. What is the xpath equivalent for the above pattern?

I'm stuck with //div[@id="foo_ and then what? If someone could continue a legal expression for it.

EDIT

Sorry, I think I have to elaborate more. Actually it's not foo_, it's post_message_

Btw, I use mechanize/nokogiri ( ruby )

Here's the snippet :

html_doc = Nokogiri::HTML(open(myfile))
message_div =  html_doc.xpath('//div[substring(@id,13) = "post_message_" and substring-after(@id, "post_message_") => 0 and substring-after(@id, "post_message_") <= 99999999]') 

Still failed. Error message:

Couldn't evaluate expression '//div[substring(@id,13) = "post_message_" and substring-after(@id, "post_message_") => 0 and substring-after(@id, "post_message_") <= 99999999]' (Nokogiri::XML::XPath::SyntaxError)

like image 628
mhd Avatar asked Feb 28 '09 12:02

mhd


2 Answers

How about this (updated):

XPath 1.0:

"//div[substring-before(@id, '_') = 'foo' 
       and substring-after(@id, '_') >= 0 
       and substring-after(@id, '_') <= 99999999]"

Edit #2: The OP made a change to the question. The following, even more reduced XPath 1.0 expression works for me:

"//div[substring(@id, 1, 13) = 'post_message_' 
       and substring(@id, 14) >= 0 
       and substring(@id, 14) <= 99999999]"

XPath 2.0 has a convenient matches() function:

"//div[matches(@id, '^foo_\d{1,8}$')]"

Apart from the better portability, I would expect the numerical expression (XPath 1.0 style) to perform better than the regex test, though this would only become noticeable when processing large data sets.


Original version of the answer:

"//div[substring-before(@id, '_') = 'foo' 
       and number(substring-after(@id, '_')) = substring-after(@id, '_') 
       and number(substring-after(@id, '_')) &gt;= 0 
       and number(substring-after(@id, '_')) &lt;= 99999999]"

The use of the number() function is unnecessary, because the mathematical comparison operators coerce their arguments to numbers implicitly, any non-numbers will become NaN and the greater than/less than tests will fail.

I also removed the encoding of the angle brackets, since this is an XML requirement, not an XPath requirement.

like image 199
Tomalak Avatar answered Oct 23 '22 00:10

Tomalak


As already pointed out, in XPath 2.0 it would be good to use its standard regex capabilities with a function like the matches() function.

One possible XPath 1.0 solution:

//div[starts-with(@id, 'post_message_')
    and
      string-length(@id) = 21
    and
      translate(substring-after(@id, 'post_message_'), 
                '0123456789', 
                ''
                )
     =
      ''
      ] 

Do note the following:

  1. The use of the standard XPath function starts-with().

  2. The use of the standard XPath function string-length().

  3. The use of the standard XPath function substring-after().

  4. The use of the standard XPath function translate().

like image 34
Dimitre Novatchev Avatar answered Oct 23 '22 01:10

Dimitre Novatchev