Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing Docx files using OOXML

How can I read word-by-word (with styles) from a docx file. I want to compare two docx files word-by-word and based on the differences I have to write into another docx file (using c# and OOXML). I have tried achieving this by using DocumentFormat.OpenXml.Extensions.dll, OpenXMLdiff.dll and ICSharpCode.SharpZipLib.dll but nothing is giving me the option to read word-by-word(ICSharpCode.SharpZipLib does give word-by-word but it will not give style associated with that word).

Any help on this will be very useful.

like image 300
user274223 Avatar asked Feb 16 '10 09:02

user274223


People also ask

How do I read a DOCX file in XML?

Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.

Is Word XML the same as docx?

docx file is an Open XML formatted Microsoft Word document. Not all applications can read all file format; and in some cases an application may only be able to read parts of the file. For example, a application may be able to read the text, but not the formatting, of a file that uses a format other than its own.

Does DOCX use XML?

DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the . doc file extension. Since Word 2007, DOCX has been the default format for the Save operation.

Can you view a docx files?

How to open a DOCX file. You can open a DOCX file with Microsoft Word in Windows and macOS. Word is the best option for opening DOCX files because it fully supports the formatting of Word documents, which includes images, charts, tables, and text spacing and alignment. Word is also available for Android and iOS devices ...


1 Answers

This MSDN article shows how to reliably retrieve the exact text of a document, paragraph by paragraph.

http://msdn.microsoft.com/en-us/library/ff686712.aspx

At the same time, you can determine the style for each paragraph. That is pretty easy. The following blog post shows how to retrieve the style and text for each paragraph:

http://blogs.msdn.com/b/ericwhite/archive/2009/02/16/finding-paragraphs-by-style-name-or-content-in-an-open-xml-word-processing-document.aspx

Comparing the two? It depends on your exact desired semantics. One approach would be to create an XML document that contains paragraphs and styles, then comparing the XML documents. The XML document might look something like this:

<Root>
  <Para>
    <Style>Normal</Style>
    <Text>This is the text of the paragraph.</Text>
  </Para>
  <Para>
    <Style>Heading1</Style>
    <Text>Overview of the Process</Text>
  </Para>
</Root>
like image 62
Eric White Avatar answered Oct 01 '22 16:10

Eric White