Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing XML within HTML using python

I have an HTML file which contains XML at the bottom of it and enclosed by comments, it looks like this:

<!DOCTYPE html>
<html>
<head>
    ***
</head>
<body>
    <div class="panel panel-primary call__report-modal-panel">
        <div class="panel-heading text-center custom-panel-heading">
            <h2>Report</h2>
        </div>
        <div class="panel-body">
            <div class="panel panel-default">
                <div class="panel-heading">
                    <div class="panel-title">Info</div>
                </div>
                <div class="panel-body">
                    <table class="table table-bordered table-page-break-auto table-layout-fixed">
                        <tr>
                            <td class="col-sm-4">ID</td>
                            <td class="col-sm-8">1</td>
                        </tr>

            </table>
        </div>
    </div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
  <mytag>
    <headername>BASE</headername>
    <fieldname>NAME</fieldname>
    <val><![CDATA[Testcase]]></val>
  </mytag>
  <mytag>
    <headername>BASE</headername>
    <fieldname>AGE</fieldname>
    <val><![CDATA[5]]></val>
  </mytag>

</ROOTTAG>
-->

Requirement is to parse the XML which is in comments in above HTML. So far I have tried to read the HTML file and pass it to a string and did following:

with open('my_html.html', 'rb') as file:
    d = str(file.read())
    d2 = d[d.index('<!--') + 4:d.index('-->')]
    d3 = "'''"+d2+"'''"

this is returning XML piece of data in string d3 with 3 single qoutes.

Then trying to read it via Etree:

ET.fromstring(d3)

but it is failing with following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 2

need some help to basically:

  • Read HTML
  • take out snippet with XML piece which is commented at bottom of HTML
  • take that string and pass to ET.fromString() function, but since this function takes string with triple qoutes, it is not formatting it properly and hence throwing the error
like image 755
ShChawla Avatar asked Oct 15 '25 08:10

ShChawla


1 Answers

You already have been on the right path. I put your HTML in the file and it works fine like following.

import xml.etree.ElementTree as ET

with open('extract_xml.html') as handle:
    content = handle.read()
    xml = content[content.index('<!--')+4: content.index('-->')]
    document = ET.fromstring(xml)

    for element in document.findall("./mytag"):
        for child in element:
            print(child, child.text)
like image 184
Thomas Lehmann Avatar answered Oct 16 '25 21:10

Thomas Lehmann