Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - lxml how to get children of element by tag name?

Tags:

python

lxml

I have a xml file that looks like this:

<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page> 

How can I get the text of each page? Right now I have a list of each page. The following code will print the text of the second page element but not the first. Is there a way to take the child element by tag name like element['text']

for i in pages:
    print i[3]
like image 746
johan dekker Avatar asked Mar 19 '17 20:03

johan dekker


People also ask

What is LXML in BeautifulSoup?

lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.

How do I import LXML?

Type “ pip install lxml ” (without quotes) in the command line and hit Enter again. This installs lxml for your default Python installation. The previous command may not work if you have both Python versions 2 and 3 on your computer. In this case, try "pip3 install lxml" or “ python -m pip install lxml “.


2 Answers

You can write code something like this :

from lxml import html

xml = """<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page>"""

root = html.fromstring(xml)
print(root.xpath('//page/text/text()'))

The result will be :

['hello world!@', 'hello world']
like image 70
Satish Prakash Garg Avatar answered Sep 21 '22 17:09

Satish Prakash Garg


To simplify the issue I am using a "Node" helper class that will return a dict:

class Node():
    @staticmethod
    def childTexts(node):
        texts={}
        for child in list(node):
            texts[child.tag]=child.text
        return texts  

example usage:

xml = """<pages>
<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page>
</pages>

"""

root = etree.fromstring(xml)
for node in root.xpath('//page'):
    texts=Node.childTexts(node)
    print (texts)

result:

{'title': 'title1', 'subtitle': 'subtitle', 'ns': '0', 'id': '1', 'text': 'hello world!@'}
{'title': 'title2', 'ns': '0', 'id': '1', 'text': 'hello world'}
like image 28
Wolfgang Fahl Avatar answered Sep 20 '22 17:09

Wolfgang Fahl