Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search for content in XPath in multiline text using Python?

Tags:

python

xpath

lxml

When I search for the existence of data in text() of an element using contains, it works for plain data but not when there are carriage returns, new lines/tags in the element content. How to make //td[contains(text(), "")] work in this case? Thank you!

XML :

<table>
  <tr>
    <td>
      Hello world <i> how are you? </i>
      Have a wonderful day.
      Good bye!
    </td>
  </tr>
  <tr>
    <td>
      Hello NJ <i>, how are you?
      Have a wonderful day.</i>
    </td>
  </tr>
</table>

Python :

>>> tdout=open('tdmultiplelines.htm', 'r')
>>> tdouthtml=lh.parse(tdout)
>>> tdout.close()
>>> tdouthtml
<lxml.etree._ElementTree object at 0x2aaae0024368>
>>> tdouthtml.xpath('//td/text()')
['\n      Hello world ', '\n      Have a wonderful day.\n      Good bye!\n    ', '\n      Hello NJ ', '\n    ']
>>> tdouthtml.xpath('//td[contains(text(),"Good bye")]')
[]  ##-> But *Good bye* is already in the `td` contents, though as a list.
>>> tdouthtml.xpath('//td[text() = "\n      Hello world "]')
[<Element td at 0x2aaae005c410>]
like image 653
ThinkCode Avatar asked Jun 19 '12 18:06

ThinkCode


1 Answers

Use:

//td[text()[contains(.,'Good bye')]]

Explanation:

The reason for the problem is not that a text node's string value is a multiline string -- the real reason is that the td element has more than one text-node children.

In the provided expression:

//td[contains(text(),"Good bye")]

the first argument passed to the function contains() is a node-set of more than one text nodes.

As per XPath 1.0 specification (in XPath 2.0 this simply raises a type error), a the evaluation of a function that expects a string argument but is passed a node-set instead, takes the string value only of the 1st node in the node-set.

In this specific case, the first text node of the passed node-set has string value:

 "
                 Hello world "

so the comparison fails and the wanted td element isn't selected.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select="//td[text()[contains(.,'Good bye')]]"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document:

<table>
      <tr>
        <td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>
      </tr>
      <tr>
        <td>
          Hello NJ <i>, how are you?
          Have a wonderful day.</i>
        </td>
      </tr>
</table>

the XPath expression is evaluated and the selected nodes (in this case just one) are copied to the output:

<td>
          Hello world <i> how are you? </i>
          Have a wonderful day.
          Good bye!
        </td>
like image 64
Dimitre Novatchev Avatar answered Oct 02 '22 23:10

Dimitre Novatchev