When you open up Word, it allows you to save as Word Open XML format. I've seen posts regarding opening up the docx file as a zip and then extracting stuff from there. But what I really want is a way to turn the docx into a single XML exactly like when doing the "save as" action in MS Office. What to do?
And how to do this for the .doc format ?
Note: I would like to do this programmatically. Preferably under Linux development conditions with PHP. But if that's not available, then other languages will do. Lastly, if it comes down to it, I can consider spinning up a Windows server to do this.
Sorry to resuscitate a dead thread, but I just found an answer for the DOCX files. A DOCX file is just a ZIP archive of XML files. So for extracting the contents of one of its file, v.gr. word/document.xml under a Linux environment, you have to run unzip:
unzip -q -c myfile.docx word/document.xml
For catching the output of this command into the $xml variable of a PHP script, you can issue:
$xml = shell_exec ("unzip -q -c myfile.docx word/document.xml");
Hoping this answer helps for DOCX files. Better late than never.
For DOC files, this method does not work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With