Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert Doc/Docx into a single XML file automatically?

When you open up Word, it allows you to save as Word Open XML format. I've seen posts regarding opening up the docx file as a zip and then extracting stuff from there. But what I really want is a way to turn the docx into a single XML exactly like when doing the "save as" action in MS Office. What to do?

And how to do this for the .doc format ?

Note: I would like to do this programmatically. Preferably under Linux development conditions with PHP. But if that's not available, then other languages will do. Lastly, if it comes down to it, I can consider spinning up a Windows server to do this.

like image 621
samxli Avatar asked Aug 13 '12 10:08

samxli


1 Answers

Sorry to resuscitate a dead thread, but I just found an answer for the DOCX files. A DOCX file is just a ZIP archive of XML files. So for extracting the contents of one of its file, v.gr. word/document.xml under a Linux environment, you have to run unzip:

unzip -q -c myfile.docx word/document.xml

For catching the output of this command into the $xml variable of a PHP script, you can issue:

$xml = shell_exec ("unzip -q -c myfile.docx word/document.xml");

Hoping this answer helps for DOCX files. Better late than never.

For DOC files, this method does not work.

like image 118
Pierre François Avatar answered Sep 18 '22 06:09

Pierre François