I have a xml file which contains multiple declarations like the following
<?xml version="1.0" encoding="UTF-8"?>
<root>
<node>
<element1>Stefan</element1>
<element2>42</element2>
<element3>Shirt</element3>
<element4>3000</element4>
</node>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<node>
<element1>Damon</element1>
<element2>32</element2>
<element3>Jeans</element3>
<element4>4000</element4>
</node>
</root>
when i tried to load the xml with
$data = simplexml_load_file("testdoc.xml") or die("Error: Cannot create object");
then it gives me the following error
Warning: simplexml_load_file(): testdoc.xml:11: parser error : XML declaration allowed only at the start of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): <?xml version="1.0" encoding="UTF-8"?> in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): testdoc.xml:12: parser error : Extra content at the end of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): <root> in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Error: Cannot create object
please let me know how to parse this xml or how to split it into no of xml files so that i can read. The file size is around 1 gb.
The second line
<?xml version="1.0" encoding="UTF-8"?>
needs to be removed. Only 1 xml declaration is a allowed in any file and it must be the first line.
Strictly speaking, you also need to have a single root element (though i've seen lenient parsers). Just wrap the contents with a pseudo tag, such that your file would look like:
<?xml version="1.0" encoding="UTF-8"?>
<metaroot><!-- synthetic unique root, no semantics attached -->
<root>
<!-- ... -->
</root>
<root>
<!-- ... -->
</root>
<!-- ... -->
</metaroot>
Solution for (very) large files:
Use sed to eliminate offending xml declarations and printf to add a single xml declaration plus a unique root element. A sequence of bash commands follows:
printf "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<metaroot>\n" >out.xml
sed '/<\?xml /d' in.xml >>out.xml
printf "\n</metaroot>\n" >>out.xml
in.xml denotes your original file,out.xml the purged result.
printf prints a single xml declaration and the opening/closing tags.
sed is a tool to edit a file line by line performing actions contingent on regex pattern matches. The pattern to match is the start of the xml declaration (<\? xml), the action to perform is to delete that line.
Notes:
sed is available for windows/macos too.Another option is to split the file into individual well-formed files (taken from this SO answer:
csplit -z -f 'temp' -b 'out%03d.xml' in.xml '/<\?xml /' {*}
which produces files named out000.xml, out001.xml, ...
You should know at least the magnitude of the number of individual files that have been processed into your input file to be safe with the autonumbering ( though you could of course take the byte number of the input file as the magnitude, using -b 'out%09d.xml' in the above command).
This is not valid XML. You will need to use string functions to split it - or to be more exact to read it part by part.
$xmlDeclaration = '<?xml version="1.0" encoding="UTF-8"?>';
$file = new SplFileObject($filename, 'r');
$file->setFlags(SplFileObject::SKIP_EMPTY);
$buffer = '';
foreach ($file as $line) {
if (FALSE === strpos($line, $xmlDeclaration)) {
$buffer .= $line;
} else {
outputBuffer($buffer);
$buffer = $line;
}
}
outputBuffer($buffer);
function outputBuffer($buffer) {
if (!empty($buffer)) {
$dom = new DOMDocument();
$dom->loadXml($buffer);
$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(//element1)'), "\n";
}
}
Output:
Stefan
Damon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With