Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplify/ Clean up XML of a DOCX word document

Tags:

I have a Microsoft Word Document (docx) and I use Open XML SDK 2.0 Productivity Tool to generate C# code from it.

I want to programmatically insert some database values to the document. For this I typed in simple text like [[place holder 1]] in the points where my program should replace the placeholders with its database values.

Unfortunately the XML output is in some kind of mess. E.g. I have a table with two neighboring cells, which shouldn't distinguish apart from its placeholder. But one of the placeholders is split into several runs.

[[good place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">   <w:tcPr>     <w:tcW w:w="1798" w:type="dxa" />     <w:shd w:val="clear" w:color="auto" w:fill="auto" />   </w:tcPr>   <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="0009453E">     <w:pPr>       <w:spacing w:after="0" w:line="240" w:lineRule="auto" />       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>     </w:pPr>     <w:r w:rsidRPr="0009453E">       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>       <w:t>[[good place holder]]</w:t>     </w:r>   </w:p> </w:tc> 

versus [[bad place holder]]

<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">   <w:tcPr>     <w:tcW w:w="1799" w:type="dxa" />     <w:shd w:val="clear" w:color="auto" w:fill="auto" />   </w:tcPr>   <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="00EA211A">     <w:pPr>       <w:spacing w:after="0" w:line="240" w:lineRule="auto" />       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>     </w:pPr>     <w:r w:rsidRPr="00EA211A">       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>       <w:t>[[</w:t>     </w:r>     <w:proofErr w:type="spellStart" />     <w:r w:rsidRPr="00EA211A">       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>       <w:t>bad</w:t>     </w:r>     <w:proofErr w:type="spellEnd" />     <w:r w:rsidRPr="00EA211A">       <w:rPr>         <w:rFonts w:cstheme="minorHAnsi" />         <w:sz w:val="20" />         <w:szCs w:val="20" />       </w:rPr>       <w:t xml:space="preserve"> place holder]]</w:t>     </w:r>   </w:p> </w:tc> 

Is there any possibility to let Microsoft Word clean up my document, so that all place holders are good to identify in the generated XML?

like image 842
K B Avatar asked Oct 13 '11 10:10

K B


People also ask

How do I see the XML of my DOCX document?

Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.

Is Word XML the same as DOCX?

docx file is an Open XML formatted Microsoft Word document. Not all applications can read all file format; and in some cases an application may only be able to read parts of the file. For example, a application may be able to read the text, but not the formatting, of a file that uses a format other than its own.


2 Answers

I have found a solution: the Open XML PowerTools Markup Simplifier.

I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.

Code quoted from Eric's blog in the link above:

using System; using System.Collections.Generic; using System.Linq; using System.Text; using OpenXmlPowerTools; using DocumentFormat.OpenXml.Packaging;  class Program {     static void Main(string[] args)     {         using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))         {             SimplifyMarkupSettings settings = new SimplifyMarkupSettings             {                 RemoveComments = true,                 RemoveContentControls = true,                 RemoveEndAndFootNotes = true,                 RemoveFieldCodes = false,                 RemoveLastRenderedPageBreak = true,                 RemovePermissions = true,                 RemoveProof = true,                 RemoveRsidInfo = true,                 RemoveSmartTags = true,                 RemoveSoftHyphens = true,                 ReplaceTabsWithSpaces = true,             };             MarkupSimplifier.SimplifyMarkup(doc, settings);         }     } } 
like image 115
K B Avatar answered Oct 25 '22 05:10

K B


You need to get rid of the Rsid information. According to this page Rsid information

enables merging of two documents that have forked.

You need to install in order to run the sample code below. The easiest way to do that is to run the following in the Package Manager Console

Install-Package OpenXmlPowerTools 

Then you will be all set to run the following code. (Assuming that you already have a "Test.docx" file added to your document. If you are using Visual Studio, you need to make sure that you have a copy of the file in either the Debug or Release folder according to your build mode.)

//Sample code to remove Rsid information from a "Test.docx" document   using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))         {             SimplifyMarkupSettings settings = new SimplifyMarkupSettings             {                   RemoveRsidInfo = true              };             MarkupSimplifier.SimplifyMarkup(doc, settings);         } 

This will remove Rsid information that may get in the way in the process of manipulating Word files.

like image 25
Amadeus Sánchez Avatar answered Oct 25 '22 05:10

Amadeus Sánchez