I have a Microsoft Word Document (docx) and I use Open XML SDK 2.0 Productivity Tool to generate C# code from it.
I want to programmatically insert some database values to the document. For this I typed in simple text like [[place holder 1]] in the points where my program should replace the placeholders with its database values.
Unfortunately the XML output is in some kind of mess. E.g. I have a table with two neighboring cells, which shouldn't distinguish apart from its placeholder. But one of the placeholders is split into several runs.
[[good place holder]]
<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:tcPr> <w:tcW w:w="1798" w:type="dxa" /> <w:shd w:val="clear" w:color="auto" w:fill="auto" /> </w:tcPr> <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="0009453E"> <w:pPr> <w:spacing w:after="0" w:line="240" w:lineRule="auto" /> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> </w:pPr> <w:r w:rsidRPr="0009453E"> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> <w:t>[[good place holder]]</w:t> </w:r> </w:p> </w:tc>
versus [[bad place holder]]
<w:tc xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:tcPr> <w:tcW w:w="1799" w:type="dxa" /> <w:shd w:val="clear" w:color="auto" w:fill="auto" /> </w:tcPr> <w:p w:rsidRPr="008C2E16" w:rsidR="001F54BF" w:rsidP="000D7B67" w:rsidRDefault="00EA211A"> <w:pPr> <w:spacing w:after="0" w:line="240" w:lineRule="auto" /> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> </w:pPr> <w:r w:rsidRPr="00EA211A"> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> <w:t>[[</w:t> </w:r> <w:proofErr w:type="spellStart" /> <w:r w:rsidRPr="00EA211A"> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> <w:t>bad</w:t> </w:r> <w:proofErr w:type="spellEnd" /> <w:r w:rsidRPr="00EA211A"> <w:rPr> <w:rFonts w:cstheme="minorHAnsi" /> <w:sz w:val="20" /> <w:szCs w:val="20" /> </w:rPr> <w:t xml:space="preserve"> place holder]]</w:t> </w:r> </w:p> </w:tc>
Is there any possibility to let Microsoft Word clean up my document, so that all place holders are good to identify in the generated XML?
Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.
docx file is an Open XML formatted Microsoft Word document. Not all applications can read all file format; and in some cases an application may only be able to read parts of the file. For example, a application may be able to read the text, but not the formatting, of a file that uses a format other than its own.
I have found a solution: the Open XML PowerTools Markup Simplifier.
I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.
Code quoted from Eric's blog in the link above:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using OpenXmlPowerTools; using DocumentFormat.OpenXml.Packaging; class Program { static void Main(string[] args) { using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true)) { SimplifyMarkupSettings settings = new SimplifyMarkupSettings { RemoveComments = true, RemoveContentControls = true, RemoveEndAndFootNotes = true, RemoveFieldCodes = false, RemoveLastRenderedPageBreak = true, RemovePermissions = true, RemoveProof = true, RemoveRsidInfo = true, RemoveSmartTags = true, RemoveSoftHyphens = true, ReplaceTabsWithSpaces = true, }; MarkupSimplifier.SimplifyMarkup(doc, settings); } } }
You need to get rid of the Rsid information. According to this page Rsid information
enables merging of two documents that have forked.
You need to install in order to run the sample code below. The easiest way to do that is to run the following in the Package Manager Console
Install-Package OpenXmlPowerTools
Then you will be all set to run the following code. (Assuming that you already have a "Test.docx" file added to your document. If you are using Visual Studio, you need to make sure that you have a copy of the file in either the Debug or Release folder according to your build mode.)
//Sample code to remove Rsid information from a "Test.docx" document using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true)) { SimplifyMarkupSettings settings = new SimplifyMarkupSettings { RemoveRsidInfo = true }; MarkupSimplifier.SimplifyMarkup(doc, settings); }
This will remove Rsid information that may get in the way in the process of manipulating Word files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With