Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify PDF files that need OCR?

Tags:

pdf

ocr

I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?

It will take for ever if I ran every single file through an OCR processor.

like image 347
Fuji - H2O Avatar asked Oct 12 '11 13:10

Fuji - H2O


People also ask

How do I find OCR in PDF?

Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.

How do you determine if a PDF is searchable?

After opening the PDF, try searching for a word known to be in the document (preferably a word that appears on several different pages) by clicking CTRL-F and entering the word in the Find box. If the message below appears, the document is not text-searchable.

What is the difference between OCR and searchable PDF?

Searchable PDFs usually result through the application of OCR (Optical Character Recognition) to scanned PDFs or other image-based documents. During the text recognition process, characters and the document structure are analyzed and “read”. A text layer is added to the image layer, usually placed underneath.

How do you identify native PDFs?

Open the file in Acrobat and zoom in to 400% or more to examine it. If the text and curved lines remain smooth it is a native PDF.


1 Answers

I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT: This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

like image 57
Ocaso Protal Avatar answered Sep 21 '22 23:09

Ocaso Protal