Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lxml find tags by regex

I'm trying to use lxml to get an array of tags that are formatted as

<TEXT1>TEXT</TEXT1>

<TEXT2>TEXT</TEXT2>

<TEXT3>TEXT</TEXT3>

I tried using

xml_file.findall("TEXT*")

but this searches for a literal asterisk.

I've also try to use ETXPath but it seems to not work. Is there any API function to work with that, because assuming that TEXT is append by integers isn't the prettiest solution.

like image 228
TenaciousRaptor Avatar asked Dec 14 '22 18:12

TenaciousRaptor


1 Answers

Yes, you can use regular expressions in lxml xpath.

Here is one example:

results = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

Of course, in the example you mention you don't really need a regular expression. You could use the starts-with() xpath function:

results = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

Complete program:

from lxml import etree

root = etree.XML('''
    <root>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
      <x-TEXT4>but never four</x-TEXT4>
    </root>''')

result1 = root.xpath(
    "//*[re:test(local-name(), '^TEXT.*')]",
    namespaces={'re': "http://exslt.org/regular-expressions"})

result2 = root.xpath("//*[starts-with(local-name(), 'TEXT')]")

assert(result1 == result2)

for result in result1:
    print result.text, result.tag

Addressing a new requirement, consider this XML:

<root>
   <tag>
      <TEXT1>one</TEXT1>
      <TEXT2>two</TEXT2>
      <TEXT3>three</TEXT3>
   </tag>
   <other_tag>
      <TEXT1>do not want to found one</TEXT1>
      <TEXT2>do not want to found two</TEXT2>
      <TEXT3>do not want to found three</TEXT3>
   </other_tag>
</root>

If one wants to find all TEXT elements that are immediate children of a <tag> element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to all TEXT elements that are immediate children of only the first tag element:

result = root.xpath("//tag[1]/*[starts-with(local-name(), 'TEXT')]")
assert(' '.join(e.text for e in result) == 'one two three')

Or, if one wants to find only the first TEXT element of each tag element:

result = root.xpath("//tag/*[starts-with(local-name(), 'TEXT')][1]")
assert(' '.join(e.text for e in result) == 'one')

Resorources:

  • http://www.w3schools.com/xpath/
  • http://lxml.de/xpathxslt.html
like image 175
Robᵩ Avatar answered Dec 28 '22 07:12

Robᵩ