I'm trying to use Python to extract multiple XML elements from a mixed-content document. The use case is an email that contains email text but also contains multiple XML trees.
Here's the example document:
Email text email text email text email text.
email signature email signature.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
Email text email text email text email text.
email signature email signature.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
Email text email text email text email text.
email signature email signature.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
Email text email text email text email text.
email signature email signature.
I want to extract the XML trees so they can be parsed by an XML parser in a for loop. I've perfected parsing the XML, and if I take one of the XML trees and parse it directly, it works llike a charm.
Any advise on how to extract the XML trees? This example is over-simplified as well, the email text and signatures are different in each example that I have, so the only reliable text to key on is the beginning and ending of the XML tree.
Question: I want to extract the XML trees so they can be parsed by an XML parser
Do you realy want to get multiple XML Trees?
I want to suggest to make one XML Tree with multiple <book
subelements.
Nevertheless, here is what you want:
xml_tag = "<?xml"
catalog_end_tag = "</catalog>"
xml_tree = []
_xml = False
with open('test/Mixed_email_xml') as fh:
while True:
line = fh.readline()
if not line: break
if line.find(xml_tag) >=0:
_xml = True
if _xml:
xml_tree.append(line)
if line.find(catalog_end_tag) >=0:
_xml = False
for line in xml_tree:
print('{}'.format(line[:-1]))
Tested with Python: 3.4.2
The simplest way:
import re
from lxml import etree
with open('email.txt') as f:
catalogs = ''.join(re.findall('<catalog.*?</catalog>', f.read(), re.S))
root = etree.fromstring('<?xml version="1.0"?><root>{}</root>'.format(catalogs))
Then you could just use root.iter('book')
to iterate over all the book
nodes.
With the help from another very smart developer, this code solves my problem.
tr1 = "<?xml"
str2 = "</catalog>"
i = 0
ii = 0
tracker = []
final_ls = []
for c in data:
for char in str1:
if data[i + ii] == char:
if ii == len(str1) - 1:
tracker.append(i)
ii += 1
i += 1
ii = 0
for xml in tracker:
ii = 0
i = xml
for c in data[i:]:
if ii == len(str2):
break
ii = 0
for char in str2:
if data[i + ii] == char:
if ii == len(str2) - 1:
final_ls.append(data[xml:i + ii])
ii += 1
else:
ii += 1
i += 1
for ls in final_ls:
print(ls)
my first idea is use the str methods to split the all the text like
t = txt.split(r'<?xml version="1.0"?>')
results = [item.split("</catalog>")[0] + "</catalog>" for item in t if item.startswith("\n<catalog>")]
for i in results:
print(i)
just as code, split by the obvious delimiter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With