I have this example xml file <pre class="prettyprint"><code><page> <title>Chapter 1</title> <content>Welcome to Chapter 1</content> </page> <page> <title>Chapter 2</title> <content>Welcome to Chapter 2</content> </page> </code></pre> I like to extract the contents of title tags and content tags. Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

There is already a built-in XML library, notably <code>ElementTree</code>. For example: <pre class="prettyprint"><code>>>> from xml.etree import cElementTree as ET >>> xmlstr = """ ... <root> ... <page> ... <title>Chapter 1</title> ... <content>Welcome to Chapter 1</content> ... </page> ... <page> ... <title>Chapter 2</title> ... <content>Welcome to Chapter 2</content> ... </page> ... </root> ... """ >>> root = ET.fromstring(xmlstr) >>> for page in list(root): ... title = page.find('title').text ... content = page.find('content').text ... print('title: %s; content: %s' % (title, content)) ... title: Chapter 1; content: Welcome to Chapter 1 title: Chapter 2; content: Welcome to Chapter 2 </code></pre>

Extracting text from XML using python

Tags:

python

xml

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

454

asked Oct 07 '11 18:10

Sudeep

2 Answers

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

167

answered Sep 19 '22 05:09

Santa

Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2

answered Sep 18 '22 05:09

Sashini Hettiarachchi

Related questions
                            
                                How do I make 2 images appear side by side in Jupyter notebook (iPython)?
                            
                                Django: Natural Sort QuerySet
                            
                                matplotlib 3.0.0, cannot import name 'get_backend' from 'matplotlib'
                            
                                How can I convert my datetime column in pandas all to the same timezone
                            
                                Is this the right way to do dependency injection in Django?
                            
                                Title for colorbar in Plotly Heatmap
                            
                                pyenv: no such command `virtualenv'
                            
                                Resources for TDD aimed at Python Web Development [closed]
                            
                                How does python close files that have been gc'ed?
                            
                                HTTP Authentication in Python
                            
                                How to export C# methods?
                            
                                Importing Python module from Bash
                            
                                error in python d not defined. [duplicate]
                            
                                Python tarfile progress output?
                            
                                How to run a code whenever a Tkinter widget value changes?
                            
                                Freeze in Python?
                            
                                How to convert string timezones in form (Country/city) into datetime.tzinfo
                            
                                Using python how to find elements in a list of lists based on a key that is an element of the inner list?
                            
                                OSError 38 [Errno 38] with multiprocessing
                            
                                Python - Multiple frames with Grid manager

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With