Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from pdf and word files

Tags:

c#

ms-word

pdf

How can I extract text from pdf or word files (remove bold, images, and other rich text formatting media) in C#?

like image 511
Alon Gubkin Avatar asked Sep 06 '10 16:09

Alon Gubkin


People also ask

How do I export just text from a PDF?

Use the Select tool and mark the content to save. Right-click the selected text and choose Export Selection As. Right-click the selected text, and choose Export Selection As from the pop-up menu. Select a format from the Save As Type list and click Save.

How can I copy and paste from a PDF and keep it from Word?

Choose Edit > Copy to copy the selected text to another application. Right-click on the selected text, and then select Copy. Right-click on the selected text, and then choose Copy With Formatting.


2 Answers

You can use the filters designed for / used by the indexing service. They're designed to extract the plain text out of various documents, which is useful for searching inside a document. You can use it for Office files, PDFs, HTML and so on, basically any file type that has a filter. The only downside is that you have to install these filters on the server, so if you don't have direct access to the server this may not be possible. Some filters come pre-installed with Windows, but some, like PDF, you have to install yourself. For a C# implementation check out this article: Using IFilter in C#

like image 77
pbz Avatar answered Sep 27 '22 17:09

pbz


PDF:

You have various options.

pdftotext:
Download the XPDF utilities. In the .zip file there are various commandline utilities. One is pdftotext(.exe). It can extract all text content from a well-behaving PDF file. Type pdftotext -help to learn about some if its commandline parameters.

Ghostscript:
Install the latest version of Ghostscript (v.8.71). Ghostscript is a PostScript- and PDF-interpreter. You can use it to extract text from a PDF as well:

gswin32c.exe ^
 -q ^
 -sFONTPATH=c:/windows/fonts ^
 -dNODISPLAY ^
 -dSAFER ^
 -dDELAYBIND ^
 -dWRITESYSTEMDICT ^
 -dSIMPLE ^
 -f ps2ascii.ps ^
 -dFirstPage=3 ^
 -dLastPage=7 ^
 input.pdf ^
 -dQUIET 

This will output text contained on pages 3-7 of input.pdf to stdout. You can redirect this to a file by appending > /path/to/output.txt to the command. (Check to make sure that the PostScript utility program ps2ascii.ps is present in your Ghostscript's lib subdirectory.)

If you omit the -dSIMPLE parameter, the text output will be guessing line breaks and word spacings. For details look at the comments inside the ps2ascii.ps file itself. You can even replace that param with -dCOMPLEX for gaining additional text formatting info.

like image 39
Kurt Pfeifle Avatar answered Sep 27 '22 15:09

Kurt Pfeifle