Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Action module to convert the word and ppt into xml using CPF

is there any way we can convert MS-Word and powerpoint data and metadata into xml using pipeline feature of CPF..?

Thanks in advance

like image 448
Saahil Gupta Avatar asked Feb 08 '23 02:02

Saahil Gupta


2 Answers

There are already pipelines to handle processing the zipped XML form of MS Office. Attach the pipelines "Office OpenXML Extract" and "WordprocessingML Process" to your domain. You won't get the full upconversion to DocBook that you would from the binary (.doc) MS Word docs, but we do tidy up the XML somewhat and you can add your own transforms onto the end.

like image 118
mholstege Avatar answered May 09 '23 21:05

mholstege


The short answer is Yes you can covert to XML.

The longer answer is, it depends on what version. Any version passed word 2007 is already in an XML format. It's just zipped up and has serval XML documents in them. The same is true for PowerPoint. The format of that XML can be tricky and you will most likely want to covert it to a cleaner version.

Also the latest version of word had a new schema so the format of the XML will be different.

You could start by seeing what xdmp:word-convert will give you. If that doesn't work well enough, you could write your own using xdmp:zip-get. Since the word file its self is a zip file you can call that and learn the way the docx is put together and decide how it should be coverted.

For this to work with CPF you will have to write your own action module and configure the CPF pipeline to have it has a step.

like image 33
Tyler Replogle Avatar answered May 09 '23 20:05

Tyler Replogle