How to identify PDF files that need OCR?

1 Answers

I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT: This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

answered Sep 21 '22 23:09

Ocaso Protal

Related questions
                            
                                how to read .doc, .docx, .xls files in android [duplicate]
                            
                                Annotating Adobe Reader PDFs with math symbols
                            
                                Display slide number in reveal.js printout
                            
                                How to generate PDF documents from rpt in a multi-threaded approach?
                            
                                IE and javascript: efficient way to decode (and render) b64-encoded PDF blob
                            
                                stamp a pdf with an image
                            
                                How to extract text from a Specific Area in a PDF using Python?
                            
                                How to replace/delete text from a pdf using python?
                            
                                Compile & Fill Jasper Report - XML datasource
                            
                                Could not render the url. Could not get image from url.Navigation timeout
                            
                                Android - how to convert html to pdf? [duplicate]
                            
                                convert pdf with 300dpi bitmaps to svg
                            
                                Calling Inkscape in Python [duplicate]
                            
                                using AJAX and Jquery to create pdf using FPDF
                            
                                Open & Edit PDF files in Android App with APIs [closed]
                            
                                Create pdf with tooltips in python
                            
                                Android PdfDocument.Page - Problems with image size
                            
                                Chrome Save as PDF changing CJK characters
                            
                                Is there any framework to highlight text on pdf file after rendering on the iphone
                            
                                Centering a .jpg image in PDF using Imagemagick?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to identify PDF files that need OCR?

Tags:

pdf

ocr

Fuji - H2O

People also ask

1 Answers

Ocaso Protal

Recent Activity

Donate For Us