Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP - Read and repair big invalid XML files

Tags:

php

xml

sax

I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <item>
    <title>Some article</title>
    <g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
  </item>
</rss>

Obviously, there is a missing </ul> closing tag in the g:material tag. Moreover, people that have developed this feed should have enclosed g:material content into CDATA, which they did not... Basically, that's what I want to do : add this missing CDATA section.

I've tried to use a SAX parser to read this file but it fails when reading the </g:material> tag since the </ul> tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.

like image 917
Remi Avatar asked Mar 28 '13 10:03

Remi


1 Answers

If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.

$ tidy -output my.clean.xml my.xml

After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.

like image 162
nibra Avatar answered Oct 25 '22 18:10

nibra