Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I query a Word docx in an ASP.NET app?

I would like to upload a Word 2007 or greater docx file to my web server and convert the table of contents to a simple xml structure. Doing this on the desktop with traditional VBA seems like it would have been easy. Looking at the WordprocessingML XML data used to create the docx file is confusing. Is there a way (without COM) to navigate the document in more of an object-oriented fashion?

like image 836
gidmanma Avatar asked Aug 18 '09 21:08

gidmanma


People also ask

What app uses DOCX?

You can upload and download files with the Google Docs app for Android. Import: You can open and edit DOC, DOCX, ODT, TXT, RTF, and HTML files.

Is DOCX compatible with word?

If you are using Microsoft Office Word 2007 or Word 2010, you can open . docx or . docm files that were created in Word 2016 and 2013.


2 Answers

I highly recommend looking into the Open XML SDK 2.0. It's a CTP, but I've found it extremely useful in manipulating xmlx files without having to deal with COM at all. The documentation is a bit sketchy, but the key thing to look for is the DocumentFormat.OpenXml.Packaging.WordprocessingDocument class. You can pick apart the .docx document if you rename the extension to .zip and dig into the XML files there. From doing that, it looks like a Table of Contents is contained in a "Structured Document" tag and that things like the headings are in a hyperlink from there. Putzing around with it a bit, I found that something like this should work (or at least give you a starting point).

WordprocessingDocument wordDoc = WordprocessingDocument.Open(Filename, false);
SdtBlock contents = wordDoc.MainDocumentPart.Document.Descendants<SdtBlock>().First();
List<string> contentList = new List<string>();
foreach (Hyperlink section in contents.Descendants<Hyperlink>())
{
    contentList.Add(section.Descendants<Text>().First().Text);
}
like image 110
Jacob Proffitt Avatar answered Sep 20 '22 06:09

Jacob Proffitt


Here is a blog post on querying Open XML WordprocessingML documents using LINQ to XML. Using that code, you can write a query as follows:

using (WordprocessingDocument doc =
    WordprocessingDocument.Open(filename, false))
{
    foreach (var p in doc.MainDocumentPart.Paragraphs())
    {
        Console.WriteLine("Style: {0}   Text: >{1}<",
            p.StyleName.PadRight(16), p.Text);
        foreach (var c in p.Comments())
            Console.WriteLine(
              "  Comment Author:{0}  Text:>{1}<",
              c.Author, c.Text);
    }
}

Blog post: Open XML SDK and LINQ to XML

-Eric

like image 22
Eric White Avatar answered Sep 19 '22 06:09

Eric White