What is the best way to parse Microsoft Office and PDF documents?

Question

I'm developing a Desktop Search Engine using VB9 (VS2008) and Lucene.NET. The Indexer in Lucene.NET accepts only raw text data and it is not possible to directly extract raw text from a Microsoft Office (DOC, DOCX, PPT, PPTX) and PDF documents. What is the best way to extract raw text data from such files?

David Tischler · Accepted Answer

You can, like the Windows Desktop Search, use components implementing the IFilter interface.

Example of its usage from .NET
Links to IFilter implementations
Description of the IFilter interface

Dirk Vollmar · Answer

I can only talk about MS Office documents here. There are several ways to do this:

Using COM automation
Using converters which output the document in a more accessible format
Using 3rd-party libraries
Using Microsoft's OpenXML SDK

COM automation has the disadvantage of not always being reliable, mainly because applications tend to hang due to modal popup dialogs.

Converters are available for Word. You could check out the Text Converter SDK available from Microsoft which would allow you to use the document converters coming with Word in a stand-alone application. Requires some C coding but since you are using the same conversion engines as Office you will get high-fidelity results. The SDK can be obtained from http://support.microsoft.com/kb/111716.

For the third option using third party libraries you might want to have a look at Apache POI or the b2xtranslator project on SourceForge. The latter provides a C# library which allows you to extract the text from binary Word documents. PowerPoint development is still in an early stadium but text extraction should already be working.

The last option would be to use Microsoft's OpenXML SDK. This might be the preferred/easiest way. Search Google for samples. You could also handle binary documents by first converting them using the Office Compatibility Pack (download and install from Microsoft):

Word:

"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>

Excel:

"C:\Program Files\Microsoft Office\Office12\excelcnv.exe" -oice <input file> <output file>

PowerPoint:

"C:\Program Files\Microsoft Office\Office12\ppcnvcom.exe" -oice <input file> <output file>

plinth · Answer

For PDF you can use my company's .NET PDF Reader component that features text extraction.

This is exactly the code you write to extract the text from a PDF:

public String ReadTextFromPages(Stream s)
{
    using (PdfTextDocument doc = new PdfTextDocument(s))
    {
        PdfTextReader rdr = doc.GetPdfTextReader();
        return rdr.ReadToEnd();
     }
}

What is the best way to parse Microsoft Office and PDF documents?

Tags:

parsing

vb.net

pdf

ms-office

lucene.net

user57175

3 Answers

David Tischler

Dirk Vollmar

plinth

Recent Activity

Donate For Us

What is the best way to parse Microsoft Office and PDF documents?

Tags:

parsing

vb.net

pdf

ms-office

lucene.net

user57175

3 Answers

David Tischler

Dirk Vollmar

plinth

Related questions

Recent Activity

Donate For Us