What is the best way to programmatically check if a PDF file is a totally scanned one? I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I'd like to know whether there is another way to cope with the problem.
As you understand the solution must be Java based.
this should be easily achievable with the Read PDF Text Activity. Read the PDF, then use IF Statement to check if PDF Text length returned from that activity is bigger then 0. If It is the text in the PDF was readable and the PDF was not scanned.
Depending how the PDF was created, text could be grouped by word, sentence, paragraph or page. This process does a good job helping you identify documents which have zero searchable words, especially those that were raw image-only scanned PDFs. If the report shows a checkmark, then you will want to OCR the document.
find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text, so are scanned PDFs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With