PHP

Question

I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <item>
    <title>Some article</title>
    <g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
  </item>
</rss>

Obviously, there is a missing </ul> closing tag in the g:material tag. Moreover, people that have developed this feed should have enclosed g:material content into CDATA, which they did not... Basically, that's what I want to do : add this missing CDATA section.

I've tried to use a SAX parser to read this file but it fails when reading the </g:material> tag since the </ul> tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.

nibra · Accepted Answer

If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.

$ tidy -output my.clean.xml my.xml

After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.

PHP - Read and repair big invalid XML files

Tags:

xml

sax

Remi

1 Answers

nibra

Recent Activity

Donate For Us