I'd need a .NET library so that using which I can extract text data from PDF, Excel and Word files. Ideally, a free tool! Would you recommend any? many thanks,

As someone who has spent many days looking for free solutions for (nearly) this exact problem, I can tell you fairly honestly that you will not find a free library that will be able to extract text from all of those formats well. The only library that I'm aware of that does a great job with all of those formats (and more) is a commercial library, and it's not actually native to .NET, it's a C++/COM library, with a C++/CLI .NET wrapper. What are some options? <ul> <li>iTextSharp -- This one is absolutely fantastic in extracting text from PDFs. While later versions of this library were commercial friendly (LGPL), the authors have decided instead that they want to charge for the software, so they've instead released it under the AGPL, so unless you want to release all of your source code, you probably don't want to use one of those versions. However, the last version (4.1.6) licensed under the LGPL can be found all over the internet. This SO question has a link to a version that is under the LGPL.</li> <li>PdfBox -- Another PDF library. This one, IMO, is better because it's under the Apache 2.0 license. There are a few issues with it, as it sometimes (perhaps rarely) will not do as good of a job as iTextSharp. I attribute this more to the fact that it's a newer library than anything else. However, my experience with this library is from months ago. This project is actively developed, and just in the last month, 52 issues have been resolved. I would keep my eye on this one. Please note this is a java library. (Keep reading below for more information on why I've included this.)</li> <li>POI or NPOI -- These are libraries specifically written for Microsoft office documents, particularly the pre-2007 formats, OLE binary file formats. It does support the newer OpenXML formats, though I'm not sure how mature that part of the library is. POI is the java version (Keep reading below for more information on why I've included this.), where NPOI is a native .NET version. However, NPOI only supports excel documents, where POI can do text extraction on many more types.</li> <li>Open XML SDK 2.0 -- A library for reading/modifying office 2007+ (unencrypted OpenXML) documents created my Microsoft themselves! This is an amazing library for working with these kinds of documents. However, it is a lower-level library and therefore doesn't actually (as far as I know of), have a it does everything text extraction class. There's a fairly good example, (I'm not sure it covers certain cases like text in tables, etc), of text extraction from a word document at this SO answer</li> <li>Tika -- Once again, another Java library (I'm not telling you about java libraries for no reason. Keep on reading! :)), and this will be as close to "one library" for text extraction as you can get. Tika can extract metadata and structured text content from many different kinds of files, using existing parsing libraries. It actually uses POI and PdfBox under the hood for office and PDF documents.</li> </ul> Non-Commercial <ul> <li> dtSearch -- This is a library I'm very familiar with. It does a fantastic job, and can parse a ridiculous amount of file formats. However, it costs money and is probably overkill for what you need. It's actually exactly what we need, but we're trying to get rid of it ourselves, because we only use it for parsing (it's actually a full-text search engine), and there's plenty of parsing libraries out there that we can use or modify to suit our needs, but it honestly blows all these other libraries out of the water. As I mentioned before, it is also not native .NET code. A C++/CLI wrapper is used to intertop between the DLL and the .NET runtime.</li> </ul> iFilters can be used, and are mentioned in several other SO answers on different questions, but the text you will get back is unstructured. Sometimes it's just bad...unreadable for humans, at least. I believe that iFilters are also deprecated, and depending on license issues, you might not be able to redistribute them. <hr> Why did I mention all of those Java libraries? Well, for two reasons. First, there are no free .NET equivalents that come close to the quality of these Java libraries. Secondly, you can use these libraries in .NET (I've personally done this myself with these libraries, so I can at least vouch for that) using IKVM. It's an implementation of Java inside of .NET. Here is a good example on using IKVM to convert Tika into a .NET assembly that can be used in your project. Perhaps the scariest thing about IKVM, is that it just works! EDIT: I forgot that the author of that blog had actually posted the code and converted libraries on a github project. So, if you want to quickly check it out, you can do so there. However, it's a much older version of Tika and over a year old. If the results aren't as you expected, I would suggest trying it yourself with the latest version.

You can take a look at toxy.codeplex.com. Toxy is a pure .NET text extraction framework. It's very simple to use Toxy. For example, to extract a Excel spreadsheet file called test.xlsx. <pre class="prettyprint"><code>ParserContext context = new ParserContext("test.xlsx"); ISpreadsheetParser parser = ParserFactory.CreateSpreadsheet(context); ToxySpreadsheet ss = parser.Parse(); //then you can start handle the result - a ToxySpreadsheet object </code></pre>

Here's a link to extracting from word document: How to extract text from MS office documents in C# and for the pdf I would use PDFsharp, it is open source and has some good examples and such on their website: http://pdfsharp.com/PDFsharp/

For text extracting from pdf itextsharp is awesome. it is free and open source. to read text from pdf it is very easy using this library.

I would recommend Aspose Total for this. A few years ago I did a project on doing pretty much exactly what you are asking and compared to using the Office Interop stuff between different versions of Office (Prior to the change to XML) Aspose was the most robust library. You will probably have to do some OCR based on what you are talking about too. It's not cheap but I found their API's pretty solid and it works on most versions of the file types you are asking about. You should be able to use the free trial to see if it will fit for you project. I have no affiliation with Aspose other than that I used their tools in a production environment. Aspose Total

How to extract text from Pdf, Word and Excel documents? [closed]

5 Answers

As someone who has spent many days looking for free solutions for (nearly) this exact problem, I can tell you fairly honestly that you will not find a free library that will be able to extract text from all of those formats well. The only library that I'm aware of that does a great job with all of those formats (and more) is a commercial library, and it's not actually native to .NET, it's a C++/COM library, with a C++/CLI .NET wrapper.

What are some options?

iTextSharp -- This one is absolutely fantastic in extracting text from PDFs. While later versions of this library were commercial friendly (LGPL), the authors have decided instead that they want to charge for the software, so they've instead released it under the AGPL, so unless you want to release all of your source code, you probably don't want to use one of those versions. However, the last version (4.1.6) licensed under the LGPL can be found all over the internet. This SO question has a link to a version that is under the LGPL.
PdfBox -- Another PDF library. This one, IMO, is better because it's under the Apache 2.0 license. There are a few issues with it, as it sometimes (perhaps rarely) will not do as good of a job as iTextSharp. I attribute this more to the fact that it's a newer library than anything else. However, my experience with this library is from months ago. This project is actively developed, and just in the last month, 52 issues have been resolved. I would keep my eye on this one. Please note this is a java library. (Keep reading below for more information on why I've included this.)
POI or NPOI -- These are libraries specifically written for Microsoft office documents, particularly the pre-2007 formats, OLE binary file formats. It does support the newer OpenXML formats, though I'm not sure how mature that part of the library is. POI is the java version (Keep reading below for more information on why I've included this.), where NPOI is a native .NET version. However, NPOI only supports excel documents, where POI can do text extraction on many more types.
Open XML SDK 2.0 -- A library for reading/modifying office 2007+ (unencrypted OpenXML) documents created my Microsoft themselves! This is an amazing library for working with these kinds of documents. However, it is a lower-level library and therefore doesn't actually (as far as I know of), have a it does everything text extraction class. There's a fairly good example, (I'm not sure it covers certain cases like text in tables, etc), of text extraction from a word document at this SO answer
Tika -- Once again, another Java library (I'm not telling you about java libraries for no reason. Keep on reading! :)), and this will be as close to "one library" for text extraction as you can get. Tika can extract metadata and structured text content from many different kinds of files, using existing parsing libraries. It actually uses POI and PdfBox under the hood for office and PDF documents.

Non-Commercial

dtSearch -- This is a library I'm very familiar with. It does a fantastic job, and can parse a ridiculous amount of file formats. However, it costs money and is probably overkill for what you need. It's actually exactly what we need, but we're trying to get rid of it ourselves, because we only use it for parsing (it's actually a full-text search engine), and there's plenty of parsing libraries out there that we can use or modify to suit our needs, but it honestly blows all these other libraries out of the water. As I mentioned before, it is also not native .NET code. A C++/CLI wrapper is used to intertop between the DLL and the .NET runtime.

^{iFilters can be used, and are mentioned in several other SO answers on different questions, but the text you will get back is unstructured. Sometimes it's just bad...unreadable for humans, at least. I believe that iFilters are also deprecated, and depending on license issues, you might not be able to redistribute them.}

Why did I mention all of those Java libraries? Well, for two reasons. First, there are no free .NET equivalents that come close to the quality of these Java libraries. Secondly, you can use these libraries in .NET (I've personally done this myself with these libraries, so I can at least vouch for that) using IKVM. It's an implementation of Java inside of .NET. Here is a good example on using IKVM to convert Tika into a .NET assembly that can be used in your project. Perhaps the scariest thing about IKVM, is that it just works!

EDIT: I forgot that the author of that blog had actually posted the code and converted libraries on a github project. So, if you want to quickly check it out, you can do so there. However, it's a much older version of Tika and over a year old. If the results aren't as you expected, I would suggest trying it yourself with the latest version.

121

answered Sep 28 '22 14:09

Christopher Currens

You can take a look at toxy.codeplex.com. Toxy is a pure .NET text extraction framework.

It's very simple to use Toxy. For example, to extract a Excel spreadsheet file called test.xlsx.

ParserContext context = new ParserContext("test.xlsx");
ISpreadsheetParser parser = ParserFactory.CreateSpreadsheet(context);
ToxySpreadsheet ss = parser.Parse();
//then you can start handle the result - a ToxySpreadsheet object

answered Sep 28 '22 14:09

Tony Qu

Here's a link to extracting from word document:

How to extract text from MS office documents in C#

and for the pdf I would use PDFsharp, it is open source and has some good examples and such on their website:

http://pdfsharp.com/PDFsharp/

answered Sep 28 '22 14:09

NKamrath

For text extracting from pdf itextsharp is awesome. it is free and open source.

to read text from pdf it is very easy using this library.

answered Sep 28 '22 14:09

Md Kamruzzaman Sarker

I would recommend Aspose Total for this. A few years ago I did a project on doing pretty much exactly what you are asking and compared to using the Office Interop stuff between different versions of Office (Prior to the change to XML) Aspose was the most robust library. You will probably have to do some OCR based on what you are talking about too. It's not cheap but I found their API's pretty solid and it works on most versions of the file types you are asking about. You should be able to use the free trial to see if it will fit for you project. I have no affiliation with Aspose other than that I used their tools in a production environment.

Aspose Total

answered Sep 28 '22 13:09

ElvisLives

Related questions
                            
                                Does C# inline properties?
                            
                                How you would you describe the Observer pattern in beginner language?
                            
                                MySQL C# Text Encoding Problems
                            
                                DataGridView capturing user row selection
                            
                                How to check file size on upload
                            
                                How to detect if we're on a UI thread?
                            
                                Can you catch more than one type of exception with each block? [duplicate]
                            
                                xmlNode to objects
                            
                                Correct way to close database connection in event of exception
                            
                                Is it possible to define an enum in C# with values that are keywords?
                            
                                Need to add text to rectangle
                            
                                email attachment from the MemoryStream comes empty [duplicate]
                            
                                How to get elements by name in XML using LINQ
                            
                                How to use Dynamic LINQ (System.Linq.Dynamic) for LIKE operation?
                            
                                Enter key triggering the Login button
                            
                                How can I use system.web.ui.datavisualization.charting.chart to make a chart? [closed]
                            
                                Linq in selecting item from ListViewItemCollections in c#
                            
                                Retrieve List of Tables from Specific Database on Server C#
                            
                                Is the .NET string hash function portable? [duplicate]
                            
                                Autofac - auto registration error : No constructors can be found with 'Public binding flags'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract text from Pdf, Word and Excel documents? [closed]

Tags:

html

c#

.net

pdf

extract

The Light

People also ask

5 Answers

Christopher Currens

Tony Qu

NKamrath

Md Kamruzzaman Sarker

ElvisLives

Recent Activity

Donate For Us