OpenXML tag search

Tags:

I'm writing a .NET application that should read a .docx file nearby 200 pages long (trough DocumentFormat.OpenXML 2.5) to find all the occurences of certain tags that the document should contain. To be clear I'm not looking for OpenXML tags but rather tags that should be set into the document by the document writer as placeholder for values I need to fill up in a second stage. Such tags should be in the following format:

 <!TAG!>

(where TAG can be an arbitrary sequence of characters). As I said I have to find all the occurences of such tags plus (if possibile) locating the 'page' where the tag occurence have been found. I found something looking around in the web but more than once the base approach was to dump all the content of the file in a string and then look inside such string regardless of the .docx encoding. This either caused false positive or no match at all (while the test .docx file contains several tags), other examples were probably a little over my knowledge of OpenXML. The regex pattern to find such tags should be something of this kind:

<!(.)*?!>

The tag can be found all over the document (inside table, text, paragraph, as also header and footer).

I'm coding in Visual Studio 2013 .NET 4.5 but I can get back if needed. P.S. I would prefer code without use of Office Interop API since the destination platform will not run Office.

The smallest .docx example I can produce store this inside document

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00CA7780" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:proofErr w:type="gramStart"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
  </w:r>
  <w:proofErr w:type="gramEnd"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRPr="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY2</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00815E5D" w:rsidRPr="00815E5D">
  <w:pgSz w:w="11906" w:h="16838"/>
  <w:pgMar w:top="1417" w:right="1134" w:bottom="1134" w:left="1134" w:header="708" w:footer="708" w:gutter="0"/>
  <w:cols w:space="708"/>
  <w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

Best Regards, Mike

304

asked Feb 24 '15 13:02

weirdgyn

1 Answers

The problem with trying to find tags is that words are not always in the underlying XML in the format that they appear to be in Word. For example, in your sample XML the <!TAG1!> tag is split across multiple runs like this:

<w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
    <w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
</w:r>

As pointed out in the comments this is sometimes caused by the spelling and grammar checker but that's not all that can cause it. Having different styles on parts of the tag could also cause it for example.

One way of handling this is to find the InnerText of a Paragraph and compare that against your Regex. The InnerText property will return the plain text of the paragraph without any formatting or other XML within the underlying document getting in the way.

Once you have your tags, replacing the text is the next problem. Due to the above reasons you can't just replace the InnerText with some new text as it wouldn't be clear as to which parts of the text would belong in which Run. The easiest way round this is to remove any existing Run's and add a new Run with a Text property containing the new text.

The following code shows finding the tags and replacing them immediately rather than using two passes as you suggest in your question. This was just to make the example simpler to be honest. It should show everything you need.

private static void ReplaceTags(string filename)
{
    Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled);

    using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true))
    {
        //grab the header parts and replace tags there
        foreach (HeaderPart headerPart in wordDocument.MainDocumentPart.HeaderParts)
        {
            ReplaceParagraphParts(headerPart.Header, regex);
        }
        //now do the document
        ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex);
        //now replace the footer parts
        foreach (FooterPart footerPart in wordDocument.MainDocumentPart.FooterParts)
        {
            ReplaceParagraphParts(footerPart.Footer, regex);
        }
    }
}

private static void ReplaceParagraphParts(OpenXmlElement element, Regex regex)
{
    foreach (var paragraph in element.Descendants<Paragraph>())
    {
        Match match = regex.Match(paragraph.InnerText);
        if (match.Success)
        {
            //create a new run and set its value to the correct text
            //this must be done before the child runs are removed otherwise
            //paragraph.InnerText will be empty
            Run newRun = new Run();
            newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value")));
            //remove any child runs
            paragraph.RemoveAllChildren<Run>();
            //add the newly created run
            paragraph.AppendChild(newRun);
        }
    }
}

One downside with the above approach is that any styles you may have had will be lost. These could be copied from the existing Run's but if there are multiple Run's with differing properties you'll need to work out which ones you need to copy where. There's nothing to stop you creating multiple Run's in the above code each with different properties if that's what is required. Other elements would also be lost (e.g. any symbols) so those would need to be accounted for too.

106

answered Sep 22 '22 19:09

petelids

Related questions
                            
                                WebAPI OData Error The ObjectContent type failed to serialize the response body for content type 'application/json...'
                            
                                How to render derived types of a class differently?
                            
                                Windows store app ResourceLoader at design time
                            
                                Using SqlDependency with named Queues
                            
                                Difference between using Split with no parameters and RemoveEmptyEntries option
                            
                                Convert String^ in c# to CString in c++/CLI
                            
                                how to resolve The message filter indicated that the application is busy. (Exception from HRESULT: 0x8001010A (RPC_E_SERVERCALL_RETRYLATER))
                            
                                Stub vs Mock when unit testing [duplicate]
                            
                                CurrentCulture with async/await, Custom synchronization context
                            
                                Avoid using the JsonIgnore attribute in a domain model
                            
                                StackOverflowException when accessing member of generic type via dynamic: .NET/C# framework bug?
                            
                                Get all XAML files from compiled DLL
                            
                                How to run background service in ios forever for syncing of data
                            
                                Miniprofiler breaks on missing CreatedOn column
                            
                                Dapper: Help me run stored procedure with multiple user defined table types
                            
                                What's the default OAuth AccessTokenFormat implementation in OWIN for IIS host?
                            
                                Use connectionstring in WebJob on Azure
                            
                                When using NuGet Pack is it possible to specify the package name without a nuspec file?
                            
                                Is there a difference between DictionarySectionHandler and NameValueSectionHandler?
                            
                                Maintaining a single instance to a RestSharp client

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

OpenXML tag search

Tags:

c#

.net

ms-word

openxml

weirdgyn

People also ask

1 Answers

petelids

Recent Activity

Donate For Us