How to know if a PDF contains only images or has been OCR scanned for searching?

Tags:

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

273

asked Sep 28 '09 22:09

Bratch

1 Answers

Scannned images converted to PDF which have been OCR'ed in the aftermath to make text searchable do normally contain the text parts rendered as "invisible". So what you see on screen (or on paper when printed) is still the original image. But when you search successfully, you get the hits highlighted that are on the invisible text.

I'd recommend you to look at the XPDF-derived commandline tools pdffonts(.exe), pdfinfo(.exe) and pdftotext(.exe). See here for downloads: http://www.foolabs.com/xpdf/download.html

Example usage of pdffonts:

C:\downloads\> pdffonts cisco-ip-phone-7911-guide6.1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
LGOKFL+Univers-BlackOblique          Type 1C           yes yes no   13171  0
LGOKGM+Univers-Black                 Type 1C           yes yes no   13172  0
[....]

This PDF uses fonts (indicated by the 'name' column), has them embedded (indicated by the 'yes' in the 'emb' column) and uses subset fonts (indicated by the 'yes' in the 'sub' column).

C:\downloads\> pdffonts examle1.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Univers-BlackOblique                 Type 1C           yes no  no   14    0
Arial                                TrueType          no  no  no   15    0

This PDF uses 2 fonts (indicated by the 'name' column). The font 'Universe-BlackOblique' is embedded completely (indicated by the 'yes' in the 'emb' column and the 'no' in the 'sub' column). The font 'Arial' is also used, but is not embedded.

C:\downloads\> pdffonts examle2.pdf
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------

This PDF uses not a single font, and hence does not have any text embedded (so no OCR either).

Example usage of pdftotext:

C:\downloads\> pdftotext ^
                   -layout ^
                   cisco-ip-phone-7911-guide6.1.pdf ^
                   cisco-ip-phone-7911-guide6.1.txt

This will extract all text strings from the PDF (trying to preserve some resemblance of the original layout). If there is no text in the PDF, you'd know there was no OCR...

133

answered Oct 07 '22 18:10

Kurt Pfeifle

Related questions
                            
                                Get all controls with names that start with specific string
                            
                                Images not visible in search results
                            
                                How to maximize non-analytic function over space of variables
                            
                                Highlight multiple words using Typeahead.js
                            
                                How to find two number whose sum is given number in sorted array in O(n)?
                            
                                How to show information from API when using search box in ReactJS?
                            
                                search and replace with regex to increment numbers in Visual Studio Code
                            
                                Efficient way to filter groups that do not contain all types of elements
                            
                                How to restrict drupal search from indexing all content types?
                            
                                Searching an unsorted array
                            
                                What's the fastest way to search a very long list of words for a match in actionscript 3?
                            
                                Serializable object in intent returning as String
                            
                                Table manipulation with jQuery
                            
                                Diagnosing a slow grep or ack search through a complex directory (code, files, php scripts, etc) for faster repeated use
                            
                                How to implement search features in ASP.NET MVC applications
                            
                                Search multiple columns - Rails
                            
                                Computing similarity between two lists
                            
                                How to search a string with spaces and special characters in vi editor
                            
                                LaTeX: How to find package(s) that a command belongs to?
                            
                                Breadth First Search and Depth First Search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to know if a PDF contains only images or has been OCR scanned for searching?

Tags:

search

pdf

ocr

acrobat

Bratch

People also ask

1 Answers

Kurt Pfeifle

Recent Activity

Donate For Us