Check if a PDF file is a scanned one

Tags:

What is the best way to programmatically check if a PDF file is a totally scanned one? I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I'd like to know whether there is another way to cope with the problem.

As you understand the solution must be Java based.

512

asked Mar 08 '10 18:03

Alex

1 Answers

Click to copy

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text, so are scanned PDFs.

105

answered Oct 21 '22 06:10

Orsiris de Jong

Related questions
                            
                                Spring boot + tomcat 8.5 + mongoDB, AsyncRequestTimeoutException
                            
                                Use JaCoCo in Android Project with Gradle
                            
                                Why do I get java.net.SocketTimeoutException with OkHttp?
                            
                                Combining @PathVariable and @RequestBody
                            
                                Nashorn alternative for Java 11 [closed]
                            
                                Why lombok adding canEqual method
                            
                                Method reference and Generics in Java-8
                            
                                Synology Scheduler .sh java command not found
                            
                                ViewPager2/Tabs problem with ViewModel state
                            
                                Lambda Expression that has return type as void can compile with wrapper but not compile with primitive [duplicate]
                            
                                How do I ensure that RMI uses only a specific set of ports?
                            
                                How to register a JavaScript callback in a Java Applet?
                            
                                Java REST client without schema
                            
                                Setup and Tear Down of Complex Database State With Hibernate / Spring / JUnit
                            
                                Seeking a High-Level Library for Socket Programming (Java or Python)
                            
                                Lazy/Eager loading strategies in remoting cases (JPA)
                            
                                How do I pass console input to a running Java program instead of to jdb?
                            
                                Distributed Job scheduling, management, and reporting
                            
                                Handle a JNI crash
                            
                                Coding myself into a corner

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Check if a PDF file is a scanned one

Tags:

java

pdf

ocr

Alex

People also ask

1 Answers

Orsiris de Jong

Recent Activity

Donate For Us