Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validating a large XML file ~400MB in PHP

I have a large XML file (around 400MB) that I need to ensure is well-formed before I start processing it.

First thing I tried was something similar to below, which is great as I can find out if XML is not well formed and which parts of XML are 'bad'

$doc = simplexml_load_string($xmlstr);
if (!$doc) {
    $errors = libxml_get_errors();

    foreach ($errors as $error) {
        echo display_xml_error($error);
    }

    libxml_clear_errors();
}

Also tried...

$doc->load( $tempFileName, LIBXML_DTDLOAD|LIBXML_DTDVALID )

I tested this with a file of about 60MB, but anything a lot larger (~400MB) causes something which is new to me "oom killer" to kick in and terminate the script after what always seems like 30 secs.

I thought I may need to increase the memory on the script so figured out the peak usage when processing 60MB and adjusted it accordingly for a large and also turn the script time limit off just in case it was that.

set_time_limit(0);
ini_set('memory_limit', '512M');

Unfortunately this didn't work, as oom killer appears to be a linux thing that kicks in if memory load (even the right term?) is consistently high.

It would be great if I could load xml in chunks somehow as I imagine this will reduce the memory load so that oom killer doesn't stick it's fat nose in and kill my process.

Does anyone have any experience validating a large XML file and capturing errors of where it's badly formed, a lot of posts I've read point to SAX and XMLReader that might solve my problem.

UPDATE So @chiborg pretty much solved this issue for me...the only downside to this method is that I don't get to see all of the errors in the file, just the first that failed which I guess makes sense as I think it can't parse past the first point that fails.

When using simplexml...it's able to capture most of the issues in the file and show me at the end which was nice.

like image 679
Carlton Avatar asked Dec 13 '12 10:12

Carlton


1 Answers

Since the SimpleXML and DOM APIs will always load the document into memory, using a streaming parser like SAX or XMLReader is the better approach.

Adpating the code from the example page, it could look like this:

$xml_parser = xml_parser_create();
if (!($fp = fopen($file, "r"))) {
    die("could not open XML input");
}

while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        $errors[] = array(
                    xml_error_string(xml_get_error_code($xml_parser)),
                    xml_get_current_line_number($xml_parser));
    }
}
xml_parser_free($xml_parser);
like image 64
chiborg Avatar answered Oct 12 '22 14:10

chiborg