I have a xml file which contains multiple declarations like the following
<?xml version="1.0" encoding="UTF-8"?>
<root>
<node>
<element1>Stefan</element1>
<element2>42</element2>
<element3>Shirt</element3>
<element4>3000</element4>
</node>
</root>
<?xml version="1.0" encoding="UTF-8"?>
<root>
<node>
<element1>Damon</element1>
<element2>32</element2>
<element3>Jeans</element3>
<element4>4000</element4>
</node>
</root>
when i tried to load the xml with
$data = simplexml_load_file("testdoc.xml") or die("Error: Cannot create object");
then it gives me the following error
Warning: simplexml_load_file(): testdoc.xml:11: parser error : XML declaration allowed only at the start of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): <?xml version="1.0" encoding="UTF-8"?> in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): testdoc.xml:12: parser error : Extra content at the end of the document in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): <root> in C:\xampp\htdocs\crea\services\testxml.php on line 3
Warning: simplexml_load_file(): ^ in C:\xampp\htdocs\crea\services\testxml.php on line 3
Error: Cannot create object
please let me know how to parse this xml or how to split it into no of xml files so that i can read. The file size is around 1 gb.
The second line
<?xml version="1.0" encoding="UTF-8"?>
needs to be removed. Only 1 xml declaration is a allowed in any file and it must be the first line.
Strictly speaking, you also need to have a single root element (though i've seen lenient parsers). Just wrap the contents with a pseudo tag, such that your file would look like:
<?xml version="1.0" encoding="UTF-8"?>
<metaroot><!-- synthetic unique root, no semantics attached -->
<root>
<!-- ... -->
</root>
<root>
<!-- ... -->
</root>
<!-- ... -->
</metaroot>
Solution for (very) large files:
Use sed
to eliminate offending xml declarations and printf
to add a single xml declaration plus a unique root element. A sequence of bash commands follows:
printf "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<metaroot>\n" >out.xml
sed '/<\?xml /d' in.xml >>out.xml
printf "\n</metaroot>\n" >>out.xml
in.xml
denotes your original file,out.xml
the purged result.
printf
prints a single xml declaration and the opening/closing tags.
sed
is a tool to edit a file line by line performing actions contingent on regex pattern matches. The pattern to match is the start of the xml declaration (<\? xml
), the action to perform is to delete that line.
Notes:
sed
is available for windows/macos too.Another option is to split the file into individual well-formed files (taken from this SO answer:
csplit -z -f 'temp' -b 'out%03d.xml' in.xml '/<\?xml /' {*}
which produces files named out000.xml
, out001.xml
, ...
You should know at least the magnitude of the number of individual files that have been processed into your input file to be safe with the autonumbering ( though you could of course take the byte number of the input file as the magnitude, using -b 'out%09d.xml'
in the above command).
This is not valid XML. You will need to use string functions to split it - or to be more exact to read it part by part.
$xmlDeclaration = '<?xml version="1.0" encoding="UTF-8"?>';
$file = new SplFileObject($filename, 'r');
$file->setFlags(SplFileObject::SKIP_EMPTY);
$buffer = '';
foreach ($file as $line) {
if (FALSE === strpos($line, $xmlDeclaration)) {
$buffer .= $line;
} else {
outputBuffer($buffer);
$buffer = $line;
}
}
outputBuffer($buffer);
function outputBuffer($buffer) {
if (!empty($buffer)) {
$dom = new DOMDocument();
$dom->loadXml($buffer);
$xpath = new DOMXPath($dom);
echo $xpath->evaluate('string(//element1)'), "\n";
}
}
Output:
Stefan
Damon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With