Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to read an MS Office Document

I have a chunk of code that I'm using to read MS Office Word documents.

static void ReadMSOfficeWordFile(string file) {
    try {
        Microsoft.Office.Interop.Word.Application msWordApp = new Microsoft.Office.Interop.Word.Application();
        object nullobj = System.Reflection.Missing.Value;
        object ofalse = false;
        object ofile = file;

        Microsoft.Office.Interop.Word.Document doc = msWordApp.Documents.Open(
                                                    ref ofile, ref nullobj, ref nullobj,
                                                    ref nullobj, ref nullobj, ref nullobj,
                                                    ref nullobj, ref nullobj, ref nullobj,
                                                    ref nullobj, ref nullobj, ref nullobj,
                                                    ref nullobj, ref nullobj, ref nullobj,
                                                    ref nullobj);
        string result = doc.Content.Text.Trim();
        doc.Close(ref ofalse, ref nullobj, ref nullobj);
        msWordApp.Quit();
        CheckLineMatch(file, result);
    }
    catch {
        RaiseError("Unable to parse file because of MS Office error.", file);
    }
}

I have three issues with this.

First- It relies on MS Office being installed on each system this might run on. Some people prefer Libre Office, but this still needs to run against MS Office Word documents.

Second- I don't know if this will even work for MS Office 2003 AND MS Office 2007 documents...

Third- It's SLOW. It's excruciatingly slow.

SO! I assume there MUST be a better way to run it than this. I'm guessing that someone has to know of a better way than what a novice is coming with. I'm only trying to read the text in the document, nothing else.

like image 277
MTeck Avatar asked Nov 26 '25 10:11

MTeck


2 Answers

We are able to achieve a lot of thing with NPOI, an open source project, without any office dependability.

for an e.g. Reading all text from a word document can implement a shown below.

public string ReadAllTextFromWordDocFile(string fileName)
{
    using (StreamReader streamReader = new StreamReader(fileName))
    {
        var document = new HWPFDocument(streamReader.BaseStream);
        var wordExtractor = new WordExtractor(document);
        var docText = new StringBuilder();
        foreach (string text in wordExtractor.ParagraphText)
        {
            docText.AppendLine(text.Trim());
        }
        streamReader.Close();
        return docText.ToString();
    }
}
like image 127
Riju Avatar answered Nov 29 '25 00:11

Riju


In response to your "Word application hanging open", you need to tell it to close.

msWordApp.Quit()

See http://msdn.microsoft.com/en-us/library/bb215475(v=office.12).aspx

Regarding the "relies on MS Offise being installed", you are using the interop. So by definition is requires it to be installed. You can look into one of the commercial libraries.

http://www.aspose.com/categories/.net-components/aspose.words-for-.net/default.aspx http://www.gemboxsoftware.com/document/pricelist

like image 40
Babak Naffas Avatar answered Nov 28 '25 22:11

Babak Naffas