Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from MS office documents in C#

I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI.

like image 485
Elias Haileselassie Avatar asked Jun 18 '09 07:06

Elias Haileselassie


People also ask

How do I extract text from a Word document?

Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.


2 Answers

For Microsoft Word 2007 and Microsoft Word 2010 (.docx) files you can use the Open XML SDK. This snippet of code will open a document and return its contents as text. It is especially useful for anyone trying to use regular expressions to parse the contents of a Word document. To use this solution you would need reference DocumentFormat.OpenXml.dll, which is part of the OpenXML SDK.

See: http://msdn.microsoft.com/en-us/library/bb448854.aspx

 public static string TextFromWord(SPFile file)     {         const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";          StringBuilder textBuilder = new StringBuilder();         using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file.OpenBinaryStream(), false))         {             // Manage namespaces to perform XPath queries.               NameTable nt = new NameTable();             XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);             nsManager.AddNamespace("w", wordmlNamespace);              // Get the document part from the package.               // Load the XML in the document part into an XmlDocument instance.               XmlDocument xdoc = new XmlDocument(nt);             xdoc.Load(wdDoc.MainDocumentPart.GetStream());              XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);             foreach (XmlNode paragraphNode in paragraphNodes)             {                 XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);                 foreach (System.Xml.XmlNode textNode in textNodes)                 {                     textBuilder.Append(textNode.InnerText);                 }                 textBuilder.Append(Environment.NewLine);             }          }         return textBuilder.ToString();     } 
like image 53
KyleM Avatar answered Sep 23 '22 02:09

KyleM


Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code (here is one such example).

like image 40
adrianbanks Avatar answered Sep 22 '22 02:09

adrianbanks