I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<item>
<title>Some article</title>
<g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
</item>
</rss>
Obviously, there is a missing </ul>
closing tag in the g:material
tag. Moreover, people that have developed this feed should have enclosed g:material
content into CDATA
, which they did not... Basically, that's what I want to do : add this missing CDATA
section.
I've tried to use a SAX parser to read this file but it fails when reading the </g:material>
tag since the </ul>
tag is missing. I've tried with XMLReader but got basically the same issue.
I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach.
Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ?
Thanks.
If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.
$ tidy -output my.clean.xml my.xml
After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With