Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing Huge XML Files in PHP

I'm trying to parse the DMOZ content/structures XML files into MySQL, but all existing scripts to do this are very old and don't work well. How can I go about opening a large (+1GB) XML file in PHP for parsing?

like image 578
Ian Avatar asked May 26 '09 16:05

Ian


People also ask

How to parse large XML files in PHP?

Example code: // open the XML file $reader = new XMLReader(); $reader->open('books. xml'); // prepare a DOM document $document = new DOMDocument(); $xpath = new DOMXpath($document); // find the first `book` element node at any depth while ($reader->read() && $reader->localName !==


1 Answers

There are only two php APIs that are really suited for processing large files. The first is the old expat api, and the second is the newer XMLreader functions. These apis read continuous streams rather than loading the entire tree into memory (which is what simplexml and DOM does).

For an example, you might want to look at this partial parser of the DMOZ-catalog:

<?php  class SimpleDMOZParser {     protected $_stack = array();     protected $_file = "";     protected $_parser = null;      protected $_currentId = "";     protected $_current = "";      public function __construct($file)     {         $this->_file = $file;          $this->_parser = xml_parser_create("UTF-8");         xml_set_object($this->_parser, $this);         xml_set_element_handler($this->_parser, "startTag", "endTag");     }      public function startTag($parser, $name, $attribs)     {         array_push($this->_stack, $this->_current);          if ($name == "TOPIC" && count($attribs)) {             $this->_currentId = $attribs["R:ID"];         }          if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {             echo $attribs["R:RESOURCE"] . "\n";         }          $this->_current = $name;     }      public function endTag($parser, $name)     {         $this->_current = array_pop($this->_stack);     }      public function parse()     {         $fh = fopen($this->_file, "r");         if (!$fh) {             die("Epic fail!\n");         }          while (!feof($fh)) {             $data = fread($fh, 4096);             xml_parse($this->_parser, $data, feof($fh));         }     } }  $parser = new SimpleDMOZParser("content.rdf.u8"); $parser->parse(); 
like image 85
Emil H Avatar answered Sep 20 '22 03:09

Emil H