I have a xml file that looks like this: <pre class="prettyprint"><code><page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page> </code></pre> How can I get the text of each page? Right now I have a list of each page. The following code will print the text of the second page element but not the first. Is there a way to take the child element by tag name like <code>element['text']</code> <pre class="prettyprint"><code>for i in pages: print i[3] </code></pre>

You can write code something like this : <pre class="prettyprint"><code>from lxml import html xml = """<page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page>""" root = html.fromstring(xml) print(root.xpath('//page/text/text()')) </code></pre> The result will be : <pre class="prettyprint"><code>['hello world!@', 'hello world'] </code></pre>

To simplify the issue I am using a "Node" helper class that will return a dict: <pre class="prettyprint"><code>class Node(): @staticmethod def childTexts(node): texts={} for child in list(node): texts[child.tag]=child.text return texts </code></pre> example usage: <pre class="prettyprint"><code>xml = """<pages> <page> <title>title1</title> <subtitle>subtitle</subtitle> <ns>0</ns> <id>1</id> <text>hello world!@</text> </page> <page> <title>title2</title> <ns>0</ns> <id>1</id> <text>hello world</text> </page> </pages> """ root = etree.fromstring(xml) for node in root.xpath('//page'): texts=Node.childTexts(node) print (texts) </code></pre> result: <pre class="prettyprint"><code>{'title': 'title1', 'subtitle': 'subtitle', 'ns': '0', 'id': '1', 'text': 'hello world!@'} {'title': 'title2', 'ns': '0', 'id': '1', 'text': 'hello world'} </code></pre>

python - lxml how to get children of element by tag name?

Tags:

python

lxml

I have a xml file that looks like this:

<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page>

How can I get the text of each page? Right now I have a list of each page. The following code will print the text of the second page element but not the first. Is there a way to take the child element by tag name like element['text']

for i in pages:
    print i[3]

746

asked Mar 19 '17 20:03

johan dekker

2 Answers

You can write code something like this :

from lxml import html

xml = """<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page>"""

root = html.fromstring(xml)
print(root.xpath('//page/text/text()'))

The result will be :

['hello world!@', 'hello world']

answered Sep 21 '22 17:09

Satish Prakash Garg

To simplify the issue I am using a "Node" helper class that will return a dict:

class Node():
    @staticmethod
    def childTexts(node):
        texts={}
        for child in list(node):
            texts[child.tag]=child.text
        return texts

example usage:

xml = """<pages>
<page>
    <title>title1</title>
    <subtitle>subtitle</subtitle>
    <ns>0</ns>
    <id>1</id>
    <text>hello world!@</text>
</page>
<page>
    <title>title2</title>
    <ns>0</ns>
    <id>1</id>
    <text>hello world</text>
</page>
</pages>

"""

root = etree.fromstring(xml)
for node in root.xpath('//page'):
    texts=Node.childTexts(node)
    print (texts)

result:

{'title': 'title1', 'subtitle': 'subtitle', 'ns': '0', 'id': '1', 'text': 'hello world!@'}
{'title': 'title2', 'ns': '0', 'id': '1', 'text': 'hello world'}

answered Sep 20 '22 17:09

Wolfgang Fahl

Related questions
                            
                                [matplotlib]: understanding "set_ydata" method
                            
                                Can I use np.resize to pad an array with np.nan
                            
                                TypeError: <Response 36 bytes [200 OK]> is not JSON serializable
                            
                                Converting unicode string to hexadecimal representation
                            
                                python hug api return custom http code
                            
                                Python Win 3.6.0 x64 issue, missing qt designer exe after pip3 install pyqt5
                            
                                How to rewrite this Flask view function to follow the post/redirect/get pattern?
                            
                                How can I move the text label of a radiobutton below the button in Python Tkinter?
                            
                                Sklearn.KMeans : how to avoid Memory or Value Error?
                            
                                Python - calculate the co-occurrence matrix
                            
                                Scrapy - get the value of Javascript variable
                            
                                Pandas - Handling NaNs in categorical data
                            
                                Add hash pattern to a seaborn bar plot
                            
                                If I want non-recursive deep copy of my object, should I override copy or deepcopy in Python?
                            
                                Write Python Dataframe to CSV file directly in Azure Datalake
                            
                                Is there a way to programmatically combine Korean unicode into one?
                            
                                Windows Shortcut to Run Python Script in Anaconda Command Prompt
                            
                                How can I use the index array in tensorflow?
                            
                                Are strings cached?
                            
                                How to interpret the upper/lower bound of a datapoint with confidence intervals?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With