Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath predicate with sub-paths with lxml?

I'm trying to understand and XPath that was sent to me for use with ACORD XML forms (common format in insurance). The XPath they sent me is (truncated for brevity):

./PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo

Where I'm running into trouble is that Python's lxml library is telling me that [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"] is an invalid predicate. I'm not able to find anywhere in the XPath spec on predicates which identifies this syntax so that I can modify this predicate to work.

Is there any documentation on what exactly this predicate is selecting? Also, is this even a valid predicate, or has something been mangled somewhere?

Possibly related:

I believe the company I am working with is an MS shop, so this XPath may be valid in C# or some other language in that stack? I'm not entirely sure.

Updates:

Per comment demand, here is some additional info.

XML sample:

<ACORD>
  <InsuranceSvcRq>
    <HomePolicyQuoteInqRq>
      <PersPolicy>
        <PersApplicationInfo>
            <InsuredOrPrincipal>
                <InsuredOrPrincipalInfo>
                    <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                </InsuredOrPrincipalInfo>
                <GeneralPartyInfo>
                    <Addr>
                        <Addr1></Addr1>
                    </Addr>
                </GeneralPartyInfo>
            </InsuredOrPrincipal>
        </PersApplicationInfo>
      </PersPolicy>
    </HomePolicyQuoteInqRq>
  </InsuranceSvcRq>
</ACORD>

Code sample (with full XPath instead of snippet):

>>> from lxml import etree
>>> tree = etree.fromstring(raw)
>>> tree.find('./InsuranceSvcRq/HomePolicyQuoteInqRq/PersPolicy/PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo/Addr/Addr1')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "lxml.etree.pyx", line 1409, in lxml.etree._Element.find (src/lxml/lxml.etree.c:39972)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 271, in find
    it = iterfind(elem, path, namespaces)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 261, in iterfind
    selector = _build_path_iterator(path, namespaces)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 245, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 207, in prepare_predicate
    raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
like image 786
Jack M. Avatar asked Jun 02 '11 17:06

Jack M.


2 Answers

Change tree.find to tree.xpath. find and findall are present in lxml to provide compatibility with other implementations of ElementTree. These methods do not implement the entire XPath language. To employ XPath expressions containing more advanced features, use the xpath method, the XPath class, or XPathEvaluator.

For example:

import io
import lxml.etree as ET

content='''\
<ACORD>
  <InsuranceSvcRq>
    <HomePolicyQuoteInqRq>
      <PersPolicy>
        <PersApplicationInfo>
            <InsuredOrPrincipal>
                <InsuredOrPrincipalInfo>
                    <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                </InsuredOrPrincipalInfo>
                <GeneralPartyInfo>
                    <Addr>
                        <Addr1></Addr1>
                    </Addr>
                </GeneralPartyInfo>
            </InsuredOrPrincipal>
        </PersApplicationInfo>
      </PersPolicy>
    </HomePolicyQuoteInqRq>
  </InsuranceSvcRq>
</ACORD>
'''
tree=ET.parse(io.BytesIO(content))
path='//PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo'
result=tree.xpath(path)
print(result)

yields

[<Element GeneralPartyInfo at b75a8194>]

while tree.find yields

SyntaxError: invalid node predicate
like image 149
unutbu Avatar answered Sep 28 '22 11:09

unutbu


Your example is perfectly fine in my opinion. I would check if lxmls XPath implementation has some documented limitations or something like that.

like image 37
Achim Avatar answered Sep 28 '22 12:09

Achim