I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out?

Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only. <h3>First: Ghostscript's <code>txtwrite</code> output device (not so good)</h3> <pre class="prettyprint"><code> gs \ -dBATCH \ -dNOPAUSE \ -sDEVICE=txtwrite \ -dFirstPage=3 \ -dLastPage=5 \ -sOutputFile=- \ /path/to/your/pdf </code></pre> This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use <pre class="prettyprint"><code> -sOutputFile=textfilename.txt </code></pre> <hr> <code>gs</code> Update: Recent versions of Ghostscript have seen major improvements in the <code>txtwrite</code> device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details. <hr> <h3>Second: Ghostscript's <code>ps2ascii.ps</code> PostScript utility (better)</h3> This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file: <pre class="prettyprint"><code>gs \ -q \ -dNODISPLAY \ -P- \ -dSAFER \ -dDELAYBIND \ -dWRITESYSTEMDICT \ -dSIMPLE \ /path/to/ps2ascii.ps \ input.ps \ -c quit </code></pre> If the <code>-dSIMPLE</code> parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used. If you replace that parameter by <code>-dCOMPLEX</code>, you'll get additional infos about colors and images used. Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it.... <h3>Third: XPDF's <code>pdftotext</code> CLI utility (more comfortable than Ghostscript)</h3> A more comfortable way to do text extraction: use <code>pdftotext</code> (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try: <pre class="prettyprint"><code> pdftotext \ -f 13 \ -l 17 \ -layout \ -opw supersecret \ -upw secret \ -eol unix \ -nopgbrk \ /path/to/your/pdf - |less </code></pre> This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less... <code>pdftotext -h</code> displays all available commandline options. Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-) <hr> <code>pdftotext</code> Update: Recent versions of Poppler's <code>pdftotext</code> have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are: <ul> <li> <code>-x <int></code> : top left corner's x-coordinate of crop area</li> <li> <code>-y <int></code> : top left corner's y-coordinate of crop area</li> <li> <code>-W <int></code> : crop area's width in pixels (defaults to 0)</li> <li> <code>-H <int></code> : crop area's height in pixels (defaults to 0)</li> </ul> Best, if used with the <code>-layout</code> parameter. <hr> <h3>Fourth: MuPDF's <code>mutool draw</code> command can also extract text</h3> The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, <code>mutool</code>. To extract text from a PDF with this tool, use: <pre class="prettyprint"><code>mutool draw -F txt the.pdf </code></pre> will emit the extracted text to <code><stdout></code>. Use <code>-o filename.txt</code> to write it into a file. <h3>Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)</h3> TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website: <blockquote> Geometry TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins. </blockquote> In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it... <hr> And there are even more options: <ol> <li> <code>podofotxtextract</code> (CLI tool) from the PoDoFo project (Open Source)</li> <li> <code>calibre</code> (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs</li> <li> <code>AbiWord</code> (a GUI word processor, Open Source) can import PDFs and save its files as .txt: <code>abiword --to=txt --to-name=output.txt input.pdf</code> </li> </ol>

PDF text extraction from given coordinates

2 Answers

Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.

First: Ghostscript's `txtwrite` output device (not so good)

 gs \    -dBATCH \    -dNOPAUSE \    -sDEVICE=txtwrite \    -dFirstPage=3 \    -dLastPage=5 \    -sOutputFile=- \    /path/to/your/pdf

This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use

   -sOutputFile=textfilename.txt

gs Update:

Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.

Second: Ghostscript's `ps2ascii.ps` PostScript utility (better)

This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:

gs \   -q \   -dNODISPLAY \   -P- \   -dSAFER \   -dDELAYBIND \   -dWRITESYSTEMDICT \   -dSIMPLE \    /path/to/ps2ascii.ps \    input.ps \   -c quit

If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.

If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.

Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....

Third: XPDF's `pdftotext` CLI utility (more comfortable than Ghostscript)

A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:

 pdftotext \    -f 13 \    -l 17 \    -layout \    -opw supersecret \    -upw secret \    -eol unix \    -nopgbrk \    /path/to/your/pdf    - |less

This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...

pdftotext -h displays all available commandline options.

Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)

pdftotext Update:

Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:

-x <int> : top left corner's x-coordinate of crop area
-y <int> : top left corner's y-coordinate of crop area
-W <int> : crop area's width in pixels (defaults to 0)
-H <int> : crop area's height in pixels (defaults to 0)

Best, if used with the -layout parameter.

Fourth: MuPDF's `mutool draw` command can also extract text

The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:

mutool draw -F txt the.pdf

will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.

Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)

TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:

Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.

In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...

And there are even more options:

podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf

198

answered Sep 25 '22 02:09

Kurt Pfeifle

I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.

Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.

answered Sep 25 '22 02:09

nguyenq

Related questions
                            
                                Is there any GNU/Linux command line utility that converts .doc(x) files to .pdf? [closed]
                            
                                Data extraction from /Filter /FlateDecode PDF stream in PHP
                            
                                Print margins in DOMPDF
                            
                                PHP mPDF save file as PDF
                            
                                "name" web pdf for better default save filename in Acrobat?
                            
                                Can't display PDF from HTTPS in IE 8 (on 64-bit Vista)
                            
                                Android download PDF from URL then open it with a PDF reader
                            
                                PDFsharp save to MemoryStream
                            
                                Convert pdf to jpeg using a free c# solution [closed]
                            
                                How to extract table as text from the PDF using Python?
                            
                                Convert a Pdf page into Bitmap in Android Java
                            
                                Best tool for text extraction from PDF in Python 3.4 [closed]
                            
                                ASP.NET MVC: How can I get the browser to open and display a PDF instead of displaying a download prompt?
                            
                                Extracting text data from PDF files
                            
                                Can a PDF file's print dialog be opened with Javascript?
                            
                                PHP get pdf file from base64 encoded data string
                            
                                How to get a single PDF document from Doxygen?
                            
                                Download pdf file using jquery ajax
                            
                                Cropping a PDF using Ghostscript 9.01
                            
                                Change PDF title in browser window

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDF text extraction from given coordinates

Tags:

pdf

text-extraction

ghostscript

AMER

People also ask

2 Answers

First: Ghostscript's `txtwrite` output device (not so good)

Second: Ghostscript's `ps2ascii.ps` PostScript utility (better)

Third: XPDF's `pdftotext` CLI utility (more comfortable than Ghostscript)

Fourth: MuPDF's `mutool draw` command can also extract text

Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)

Kurt Pfeifle

nguyenq

Recent Activity

Donate For Us

PDF text extraction from given coordinates

Tags:

pdf

text-extraction

ghostscript

AMER

People also ask

2 Answers

First: Ghostscript's txtwrite output device (not so good)

Second: Ghostscript's ps2ascii.ps PostScript utility (better)

Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)

Fourth: MuPDF's mutool draw command can also extract text

Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)

Kurt Pfeifle

nguyenq

Related questions

Recent Activity

Donate For Us

First: Ghostscript's `txtwrite` output device (not so good)

Second: Ghostscript's `ps2ascii.ps` PostScript utility (better)

Third: XPDF's `pdftotext` CLI utility (more comfortable than Ghostscript)

Fourth: MuPDF's `mutool draw` command can also extract text