Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to access the text overlay from a searchable PDF?

Tags:

pdf

ocr

I understand there is a difference between a PDF and a text searchable PDF. Text searchable PDFs have a text overlay that is used for searching. Is it possible to extract this text overlay into a txt file? Perhaps with an Adobe API?

like image 535
bheussler Avatar asked Oct 04 '12 16:10

bheussler


1 Answers

"Searchable PDF" is not an official definition, but it is a commonly used expression.

If a standard PDF has all fonts embedd which it uses, and if these fonts don't use a custom encoding, chances are that it is "searchable": that means you can copy'n' paste text from it, and you can extract text from it (and tools like pdftotext work more or less flawlessly). This has nothing to do with "text overlay", it's the standard architecture of a PDF.

What you describe as "text overlay" is what can be added to a scanned PDF. PDFs created from scans are full-page images, usually TIFF, that are embedded in (otherwise empty) PDF pages. Then, in an additional step the "text overlay" is added by running OCR (optical character recognition) against it. This provides the "searchability" to the otherwise dumb 'pixels-only' PDF.

If such a PDF with a "text overlay" doesn't use weird constructions around its fonts, then it should be easy to extract this text into a *.txt file. After all, running an OCR over an image-only PDF aims to add "searchable" text:

  • Install pdftotext (available for Linux, Unix, Windows, Mac OS X) and then try running:

    pdftotext -layout some-input.pdf  some-input.txt
    

Caveats, most of OCR works far from perfectly. If you had a recognition rate of 99% for all characters, you'll be lucky. (But this means: about 10% of all words and about 100% of all sentences contain an error -- something that would give you guaranteed failure in high school...)

It should also be noted that these "text overlays" technically are identical to any other text section in PDFs (except, they contain more spelling and grammatical errors :-) -- but they use a special text rendering mode (mode 3), described as "Neither fill nor stroke text (invisible)." Though it's 'invisible', you can still highlight, copy'n'paste or extract these text sections.

like image 104
Kurt Pfeifle Avatar answered Jan 05 '23 21:01

Kurt Pfeifle