Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do we convert a Microsoft Word DOCX file to HTML in XSLT?

I have project about transforming Word DOCX XML (OOXML) files to HTML format.

I use XML Spy and XSLT, XPath, XML for this transformation.

Imagine a single Word file that I write a program in XSLT and transform it. But my supervisor says that if i change a value in the file that approach won't work.

I agree with that because I specify the code just for that document because I know what contains in it.

But, how do we write a general code in XSLT to transform all the Word files as well-formed HTML document (since a word document can be a lot different than each other)?

The problem is that I am trying to do it with XSLT? Is something wrong here isn't there? Or am i just being so chaotic about that.

like image 679
Sojimanatsu Avatar asked Mar 21 '17 13:03

Sojimanatsu


1 Answers

Your plan to use XSLT to transform DOCX files to HTML is fundamentally sound. XSLT is ideal for this purpose as it is well suited for mapping from XML to XML (or (X)HTML).

Your challenge will be that the XML underlying DOCX is complex. Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference alone is over 5K pages long. If you know XML, XML namespaces, XSLT, HTML, and CSS well, you'll "just" have to learn some basics of OOXML to get started.

The concern about changing a value won't matter if you do this robustly and fundamentally understand OOXML. Start with the notion of runs of text in paragraphs: w:t, w:r and w:p.

Eric White has written extensively on OOXML in general and even transforming it to HTML specifically. See Transforming Open XML WordprocessingML to XHtml for excellent articles and examples.

like image 117
kjhughes Avatar answered Sep 30 '22 13:09

kjhughes