Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a glossary of Word .docx XML tags?

Tags:

xml

glossary

docx

I'm trying to create a parser to find the tracked changes and author of a Word .docx file...

I found the document.xml but there are so many tags! Is there a glossary somewhere to what all those tags stand for?

I'd like to avoid brute forcing my way through this if possible.

like image 338
fifamaniac04 Avatar asked Oct 12 '17 16:10

fifamaniac04


People also ask

Is word XML the same as DOCX?

For example, a . docx file is an Open XML formatted Microsoft Word document.

How do I see the XML of my DOCX document?

Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.

Is DOCX an XML file?

DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the . doc file extension. Since Word 2007, DOCX has been the default format for the Save operation.


1 Answers

The "Office Open XML" format and its XML vocabularies are described in detail in http://www.ecma-international.org/publications/standards/Ecma-376.htm .

To give you an idea, the following piece of XSLT should extract just the effective result text without tracked deletions of a wordprocessingML document, like would be stored under word/document.xml in a .docx file (a ZIP archive).

<!-- Match and output text spans except when
     appearing in w:delText child content -->
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <xsl:output method="text"/>
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="w:delText"/>
  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>
</xsl:stylesheet>

For your application to extract changes instead, you'd also have to take care of w:ins elements.

like image 122
imhotap Avatar answered Sep 22 '22 16:09

imhotap