Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if a PDF file is a scanned one

Tags:

java

pdf

ocr

What is the best way to programmatically check if a PDF file is a totally scanned one? I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I'd like to know whether there is another way to cope with the problem.

As you understand the solution must be Java based.

like image 512
Alex Avatar asked Mar 08 '10 18:03

Alex


People also ask

How do you check if a document is scanned or not?

this should be easily achievable with the Read PDF Text Activity. Read the PDF, then use IF Statement to check if PDF Text length returned from that activity is bigger then 0. If It is the text in the PDF was readable and the PDF was not scanned.

How can I tell if a PDF is OCR?

Depending how the PDF was created, text could be grouped by word, sentence, paragraph or page. This process does a good job helping you identify documents which have zero searchable words, especially those that were raw image-only scanned PDFs. If the report shows a checkmark, then you will want to OCR the document.


1 Answers

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text, so are scanned PDFs.

like image 105
Orsiris de Jong Avatar answered Oct 21 '22 06:10

Orsiris de Jong