Extract text from pdf and word files

2 Answers

You can use the filters designed for / used by the indexing service. They're designed to extract the plain text out of various documents, which is useful for searching inside a document. You can use it for Office files, PDFs, HTML and so on, basically any file type that has a filter. The only downside is that you have to install these filters on the server, so if you don't have direct access to the server this may not be possible. Some filters come pre-installed with Windows, but some, like PDF, you have to install yourself. For a C# implementation check out this article: Using IFilter in C#

answered Sep 27 '22 17:09

pbz

PDF:

You have various options.

pdftotext:
Download the XPDF utilities. In the .zip file there are various commandline utilities. One is pdftotext(.exe). It can extract all text content from a well-behaving PDF file. Type pdftotext -help to learn about some if its commandline parameters.

Ghostscript:
Install the latest version of Ghostscript (v.8.71). Ghostscript is a PostScript- and PDF-interpreter. You can use it to extract text from a PDF as well:

gswin32c.exe ^
 -q ^
 -sFONTPATH=c:/windows/fonts ^
 -dNODISPLAY ^
 -dSAFER ^
 -dDELAYBIND ^
 -dWRITESYSTEMDICT ^
 -dSIMPLE ^
 -f ps2ascii.ps ^
 -dFirstPage=3 ^
 -dLastPage=7 ^
 input.pdf ^
 -dQUIET

This will output text contained on pages 3-7 of input.pdf to stdout. You can redirect this to a file by appending > /path/to/output.txt to the command. (Check to make sure that the PostScript utility program ps2ascii.ps is present in your Ghostscript's lib subdirectory.)

If you omit the -dSIMPLE parameter, the text output will be guessing line breaks and word spacings. For details look at the comments inside the ps2ascii.ps file itself. You can even replace that param with -dCOMPLEX for gaining additional text formatting info.

answered Sep 27 '22 15:09

Kurt Pfeifle

Related questions
                            
                                Documenting (XML) Application Settings in Visual Studio 2010
                            
                                What is @namespace field in C# class? [duplicate]
                            
                                Class inheriting from several Interfaces having same method signature
                            
                                Using Timer only once
                            
                                What's the best free open source alternative to MS Team Foundation Server? [closed]
                            
                                Do vs. Run vs. Execute vs. Perform verbs [closed]
                            
                                Regex split string but keep separators
                            
                                Explicit construction of entity type [MyClass] in query is not allowed
                            
                                Can the get of a property be abstract and the set be virtual?
                            
                                Problems with data driven testing in MSTest
                            
                                Is there a way to determine the physical size of the monitor?
                            
                                Assembly.GetTypes() for nested classes
                            
                                How to get absolute file path from base path and relative containing ".."?
                            
                                Run all open tests in Visual Studio
                            
                                Add Ellipse Position within Canvas
                            
                                Creating an expression tree that calls a method
                            
                                Unicode class names in C# - why do some work, when others don't?
                            
                                Is using Assembly.Load a static reference or dynamic reference?
                            
                                It is possible to copy all the properties of a certain control? (C# window forms)
                            
                                Selenium 2 - Switching focus to a frame that has no name/id

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract text from pdf and word files

Tags:

c#

ms-word

pdf

Alon Gubkin

People also ask

2 Answers

pbz

PDF:

Kurt Pfeifle

Recent Activity

Donate For Us