Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting text from XML using python

Tags:

python

xml

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

like image 454
Sudeep Avatar asked Oct 07 '11 18:10

Sudeep


People also ask

How do I get text from XML?

How to extract text and metadata from XML files. Click inside the file drop area to upload a XML file or drag & drop a XML file. Click Get Text and Metadata button to extract text and metadata from your XML document. Once your XML is processed click on Download Now button.

How do you parse an XML string in Python?

There are two ways to parse the file using 'ElementTree' module. The first is by using the parse() function and the second is fromstring() function. The parse () function parses XML document which is supplied as a file whereas, fromstring parses XML when supplied as a string i.e within triple quotes.


2 Answers

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
like image 167
Santa Avatar answered Sep 19 '22 05:09

Santa


Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2
like image 22
Sashini Hettiarachchi Avatar answered Sep 18 '22 05:09

Sashini Hettiarachchi