I have got hundereds of HTML files that need to be conveted in XML. We are using these HTML to serve contents for applications but now we have to serve these contents as XML.
HTML files are contains, tables, div's, image's, p's, b or strong tags, etc..
I googled and found some applications but i couldn't achive yet.
Could you suggest a way to convert these file contents to XML?
HTML and XML are related to each other, where HTML displays data and describes the structure of a webpage, whereas XML stores and transfers data. HTML is a simple predefined language, while XML is a standard language that defines other languages.
You can include HTML content. One possibility is encoding it in BASE64 as you have mentioned. Another might be using CDATA tags. just remember that XML and CDATA preserve white-space.
I was successful using tidy
command line utility. On linux I installed it quickly with apt-get install tidy
. Then the command:
tidy -q -asxml --numeric-entities yes source.html >file.xml
gave an xml file, which I was able to process with xslt processor. However I needed to set up xhtml1 dtds correctly.
This is their homepage: html-tidy.org (and the legacy one: HTML Tidy)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With