I have a xml file that looks like this:
<page>
<title>title1</title>
<subtitle>subtitle</subtitle>
<ns>0</ns>
<id>1</id>
<text>hello world!@</text>
</page>
<page>
<title>title2</title>
<ns>0</ns>
<id>1</id>
<text>hello world</text>
</page>
How can I get the text of each page? Right now I have a list of each page. The following code will print the text of the second page element but not the first. Is there a way to take the child element by tag name like element['text']
for i in pages:
print i[3]
lxml can make use of BeautifulSoup as a parser backend, just like BeautifulSoup can employ lxml as a parser. When using BeautifulSoup from lxml, however, the default is to use Python's integrated HTML parser in the html. parser module.
Type “ pip install lxml ” (without quotes) in the command line and hit Enter again. This installs lxml for your default Python installation. The previous command may not work if you have both Python versions 2 and 3 on your computer. In this case, try "pip3 install lxml" or “ python -m pip install lxml “.
You can write code something like this :
from lxml import html
xml = """<page>
<title>title1</title>
<subtitle>subtitle</subtitle>
<ns>0</ns>
<id>1</id>
<text>hello world!@</text>
</page>
<page>
<title>title2</title>
<ns>0</ns>
<id>1</id>
<text>hello world</text>
</page>"""
root = html.fromstring(xml)
print(root.xpath('//page/text/text()'))
The result will be :
['hello world!@', 'hello world']
To simplify the issue I am using a "Node" helper class that will return a dict:
class Node():
@staticmethod
def childTexts(node):
texts={}
for child in list(node):
texts[child.tag]=child.text
return texts
example usage:
xml = """<pages>
<page>
<title>title1</title>
<subtitle>subtitle</subtitle>
<ns>0</ns>
<id>1</id>
<text>hello world!@</text>
</page>
<page>
<title>title2</title>
<ns>0</ns>
<id>1</id>
<text>hello world</text>
</page>
</pages>
"""
root = etree.fromstring(xml)
for node in root.xpath('//page'):
texts=Node.childTexts(node)
print (texts)
result:
{'title': 'title1', 'subtitle': 'subtitle', 'ns': '0', 'id': '1', 'text': 'hello world!@'}
{'title': 'title2', 'ns': '0', 'id': '1', 'text': 'hello world'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With