I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?
It will take for ever if I ran every single file through an OCR processor.
Open a PDF file containing a scanned image in Acrobat for Mac or PC. Click on the “Edit PDF” tool in the right pane. Acrobat automatically applies optical character recognition (OCR) to your document and converts it to a fully editable copy of your PDF. Click the text element you wish to edit and start typing.
After opening the PDF, try searching for a word known to be in the document (preferably a word that appears on several different pages) by clicking CTRL-F and entering the word in the Find box. If the message below appears, the document is not text-searchable.
Searchable PDFs usually result through the application of OCR (Optical Character Recognition) to scanned PDFs or other image-based documents. During the text recognition process, characters and the document structure are analyzed and “read”. A text layer is added to the image layer, usually placed underneath.
Open the file in Acrobat and zoom in to 400% or more to examine it. If the text and curved lines remain smooth it is a native PDF.
I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.
EDIT: This should get you started:
foreach ($pdffile in get-childitem -filter *.pdf){
$pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
write-host $pdffile.fullname
write-host $pdftext.length;
write-host $pdftext;
write-host "-------------------------------";
}
Unfortunately even when you have only images in your PDF pdftotext
will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With