Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a big XML file into smallers with PHP?

Tags:

php

split

xml

I have XML file with over 300 000 entries which my script has to parse each day..

The xml is with structure as :

<root>
     <item>
        <proper1></proper1>
        <proper2></proper2>
    </item>
</root>

I need to split the big XML file into smaller files so my PHP can run them, currently it cant proccess it because it uses too much memory. Can any one help me with that ?

like image 636
Svetoslav Avatar asked Dec 04 '22 03:12

Svetoslav


1 Answers

Much would depend on your XML file structure.

For example, you could do something like this (assuming the structure is the one you posted, carriage returns included, otherwise things get more complicated):

Line-chopping version: blazing fast slicing a large XML file if formatted "just right"
while crashing and burning if the file is not formatted just exactly like so

$fp = fopen(XMLFILE, 'r');
$decl = fgets($fp, 1024);  // Drop the XML declaration '<?xml...?>'
$root = fgets($fp, 1024);  // Drop the root declaration
$n = 1;
while(!feof($fp)) {
    $tag = fgets($fp, 1024);
    if ('<item>' === $tag) {
        isset($gp) || trigger_error('Unexpected state');
        $gp = fopen("chunk{$n}.xml", 'w'); $n++;
        // Write the header of the file we saved from before
        fwrite($gp, $decl);
        fwrite($gp, $root);
    } else if ('</item>' === $tag) {
        fwrite($gp, $tag);
        fwrite($gp, '</root>');
        fclose($gp); unset($gp);
        continue;
    }
    if (!isset($gp)) {
        if ('</root>' === $tag /* EOF */) {
            break;
        } else {
            trigger_error('Unexpected state 2');
        }
    }
    fwrite($gp, $tag);
}
fclose($fp);
isset($gp) || trigger_error('Unexpected state 3');

This has the major benefit of allowing you to 'recycle' your XML parsing script (indeed, you could call the XML parsing script as soon as you close $gp, or even better, not write to any file at all but enqueue fwrites on a buffer, and call the script with that buffer).

Another advantage is to be able to 'outsource' the files among different subservers, for example is XML processing is long due to DNS resolutions, DB calls, HTTP/SOAP calls, need for feedback and so on. In that case you could save the file in different subdirectories based on ($n % NUM_CLIENTS), and every client could fetch one file at a time, process and delete it and go on.

Yet, the best way to proceed would be instead to rewrite your script in order not to load the XML in memory but parse it a little at a time, using the XML Parser support.

A compromise is to use XML Parser to slice the XML file and feed it to your existing script, "as is".

XMLparse version: efficiently slicing and dicing a large XML file
without worrying about how the XML is actually put together

The xmlparse functions work through callbacks, i.e. you feed your data to an entry point (xml_parse) which then analyzes the data, splits it, routes the various chunks to the appropriate subfunctions which you define. xml_parse will deal with encoding and whitespace, thereby freeing you from the need of coping with same, which is one of the greatest drawbacks in the above code. The xmlparse core itself does not keep data, so we can achieve a constant-memory implementation even for gigabyte (or terabyte) files.

So let us see how to rewrite the code for XMLParser, and chunk the big file by splitting a certain number of repetitions of a given tag.

I.e., input file:

 <root><item>(STUFF OF ITEM1)</item><item>(STUFF OF ITEM2)/item>....ITEM1234...</root>

output files:

 FILE1: <root><item>(1)</item><item>(2)</item>...(5)</root>
 FILE2: <root><item>(6)</item><item>(7)</item>...(10)</root>
 ...

We do this by writing a XMLparser that will extract each "chunk" of N (here N=5) items and feed it to a chunk processor, which upon receiving ... will wrap it between tags, add a XML header, and thus produce a file with the same syntax of the original big file, but with only five items.

To save in separate files, we keep track of chunk number.

    function processChunk($lastChunk = false) {
         GLOBAL $CHUNKS, $PAYLOAD, $ITEMCOUNT;
         if ('' == $PAYLOAD) {
             return;
         }
         $xp = fopen($file = "output-$CHUNKS.xml", "w");
         fwrite($xp, '<?xml version="1.0"?>'."\n");
             fwrite($xp, "<root>");
                 fwrite($xp, $PAYLOAD);
             $lastChunk || fwrite($xp, "</root>");
         fclose($xp);
         print "Written {$file}\n";
         $CHUNKS++;
         $PAYLOAD    = '';
         $ITEMCOUNT  = 0;
    }

The xmlparse functions require callbacks: one that receives tag OPENING, one tag CLOSING, one gets the content and another gets the whatever. We're not interesting in the whatever, so we only fill the first three handlers.

    function startElement($xml, $tag, $attrs = array()) {
        GLOBAL $PAYLOAD, $CHUNKS, $ITEMCOUNT, $CHUNKON;
        if (!($CHUNKS||$ITEMCOUNT)) {
            if ($CHUNKON == strtolower($tag)) {
                $PAYLOAD = '';
            }
        }
        $PAYLOAD .= "<{$tag}";
        foreach($attrs as $k => $v) {
            $PAYLOAD .= " {$k}=\"" .addslashes($v).'"';
        }
        $PAYLOAD .= '>';
    }

    function endElement($xml, $tag) {
        GLOBAL $CHUNKON, $ITEMCOUNT, $ITEMLIMIT;
        dataHandler(null, "</{$tag}>");
        if ($CHUNKON == strtolower($tag)) {
             if (++$ITEMCOUNT >= $ITEMLIMIT) {
                 processChunk();
             }
        }
    }

    function dataHandler($xml, $data) {
        GLOBAL $PAYLOAD;
        $PAYLOAD .= $data;
    }

    function defaultHandler($xml, $data) {
        // a.k.a. Wild Text Fallback Handler, or WTFHandler for short.
    }

The createXMLParser function is standalone for clarity

    function createXMLParser($CHARSET, $bareXML = false) {
            $CURRXML = xml_parser_create($CHARSET);
            xml_parser_set_option( $CURRXML, XML_OPTION_CASE_FOLDING, false);
            xml_parser_set_option( $CURRXML, XML_OPTION_TARGET_ENCODING, $CHARSET);
            xml_set_element_handler($CURRXML, 'startElement', 'endElement');
            xml_set_character_data_handler($CURRXML, 'dataHandler');
            xml_set_default_handler($CURRXML, 'defaultHandler');
            if ($bareXML) {
                xml_parse($CURRXML, '<?xml version="1.0"?>', 0);
            }
            return $CURRXML;
    }

Finally the feeding loop, which opens Mr. Big File and sends it to the grinder.

    function chunkXMLBigFile($file, $tag = 'item', $howmany = 5) {
         GLOBAL $CHUNKON, $CHUNKS, $ITEMLIMIT;

         // Every chunk only holds $ITEMLIMIT "$CHUNKON" elements at most.
         $CHUNKON   = $tag;
         $ITEMLIMIT = $howmany;

         $xml = createXMLParser('UTF-8', false);

         $fp = fopen($file, 'r');
         $CHUNKS  = 0;
         while(!feof($fp)) {
              $chunk = fgets($fp, 10240);
              xml_parse($xml, $chunk, feof($fp));
         }
         xml_parser_free($xml);

         // Now, it is possible that one last chunk is still queued for processing.
         processChunk(true);
    }

Then we call the machine: "Split test.xml into pieces of 5 instances of the item tag"

    ChunkXMLBigFile('test.xml', 'item', 5);

This implementation runs about five times as slow as the stupid chunker at the beginning, but can deal with tags in the same line, and can be even expanded to validate the XML.

like image 108
LSerni Avatar answered Dec 06 '22 10:12

LSerni