Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to index Word 2003, 2007 and 2010 documents using Lucene.NET

I am writing a custom Lucene.NET indexer to enable indexing of MS Word documents. The indexer must be capable of handling last three releases of MS Word: 2010, 2007 and 2003.

The plan is to use VSTO interop assemblies that are installed as part of VS2010 to extract text content from the documents.

Is there a better way to implement Word document indexing? Does this mean I will have to install all three versions of Word on the server? Or just Word 2010?

Tools/Environment:

  • Lucene.NET 2.3.1.3
  • VS2010 / .NET 3.5
  • Windows 2008 / IIS 7

Note: For details on how to implement this, see Sitecore text search in PDF or Word documents

like image 836
Arnold Zokas Avatar asked Oct 25 '10 12:10

Arnold Zokas


1 Answers

You could you use the IFilter plugins to let you retrieve the contents of the documents and then index them. The interface is originally part of Microsoft Index Service but is generally available for indexing documents.

I looked into the technology a couple of years ago and seem to remember that either the filters for Office documents were built into Windows or could be installed separately from the complete Office package but I may be wrong here.

More about the IFilter technology at IFilter at Wikipedia and IFilter at MSDN. You will have to look into P/Invoke and might get some inspiration IFilter at pinvoke.net.

A sample in C# can be found at MSDN Code Gallery.

like image 123
HakonB Avatar answered Nov 10 '22 11:11

HakonB