Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I debug a corrupt docx file?

I have an issue where .doc and .pdf files are coming out OK but a .docx file is coming out corrupt.

In order to solve that I am trying to debug why the .docx is corrupt.

I learned that the docx format is much stricter with regard to extra characters than either .pdf or .doc. Therefore I have searched the various xml files WITHIN the docx file looking for invalid XML. But I can't find any. It all validates fine.

xml files I've been checking out

Could anyone suggest directions for me to investigate now?

UPDATE:

The full listing of files inside the folder is as follows:

/_rels
    .rels

/customXml
    /_rels
        .rels
    item1.xml
    itemProps1.xml

/docProps
    app.xml
    core.xml

/word
    /_rels
        document.xml.rels
    /media
        image1.jpeg
    /theme
        theme1.xml
    document.xml
    fontTable.xml
    numbering.xml
    settings.xml
    styles.xml
    stylesWithEffects.xml
    webSettings.xml

[Content_Types].xml

UPDATE 2:

I should also have mentioned that the reason for corruption is almost certainly a bad binary file POST on my behalf.

why are docx files corrupted by binary post, but .doc and .pdf are fine?

UPDATE 3:

I have tried the demo various docx repair tools. They all seem to repair the file ok but give no clue as to the cause of the error.

My next step is to examine the contents of the corrupted file with the repaired version.

If anybody knows of a docx repair tool that gives a decent error message I'd appreciate hearing about it. In fact I might post that as a separate question.

UPDATE 4 (2017)

I never solved this problem. I have tried all the tools suggested in the answers below but none of them worked for me.

I have since progressed a little further and found a block of 0000 missing when opening the .docx in Sublime Text. More details in the new question here: What could be causing this corruption in .docx files during httpwebrequest?

like image 608
Martin Hansen Lennox Avatar asked Aug 12 '13 18:08

Martin Hansen Lennox


People also ask

How do I uncorrupt a DOCX file?

Solution 1: Use the inbuilt Microsoft Word Repair tool Open Microsoft Word and click on File. Click Open and select the . docx file with the problem. Click the down arrow next to the Open button and choose Open and repair.

What does a corrupted Word document look like?

A corrupted word file becomes inaccessible and unreadable. When you try to open a corrupted word file, you will get an error saying the document can't be read because it is corrupt or there is some problem with it. Sometimes, the file also gets corrupted accidentally, such as a virus attack, or system crash.


3 Answers

I used the "Open XML SDK 2.5 Productivity Tool" (http://www.microsoft.com/en-us/download/details.aspx?id=30425) to find a problem with a broken hyperlink reference.

You have to download/install the SDK first, then the tool. The tool will open and analyze the document for problems.

like image 200
Jeremy K Avatar answered Sep 19 '22 13:09

Jeremy K


Usually, when there is an error with a particular XML file, Word tells you on which line of which file the error happens. So I believe the problem comes from either the Zipping of the file, either the folder structure.

Here is the folder structure of a Word file:

The .docx format is a zipped file that contains the following folders:

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

It seems that you have only what is inside the word folder, isn't it ? If this doesn't work, could you please either send the corrupted Docx or post the structure of your folders inside your zip ?

like image 36
edi9999 Avatar answered Sep 17 '22 13:09

edi9999


Many years late, but I found this which actually worked for me. (From https://msdn.microsoft.com/en-us/library/office/bb497334.aspx)

(wordDoc is a WordprocessingDocument)

using DocumentFormat.OpenXml.Validation;

        try
        {
            var validator = new OpenXmlValidator();
            var count = 0;
            foreach (var error in validator.Validate(wordDoc))
            {
                count++;
                Console.WriteLine("Error " + count);
                Console.WriteLine("Description: " + error.Description);
                Console.WriteLine("ErrorType: " + error.ErrorType);
                Console.WriteLine("Node: " + error.Node);
                Console.WriteLine("Path: " + error.Path.XPath);
                Console.WriteLine("Part: " + error.Part.Uri);
                Console.WriteLine("-------------------------------------------");
            }

            Console.WriteLine("count={0}", count);
        }

        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
like image 44
Blue Eyed Behemoth Avatar answered Sep 20 '22 13:09

Blue Eyed Behemoth