I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit PhantomPDF where you can do OCR in multiple files. I would like to find all PDF documents of mine which are image-based. I do not understand how the PDF reader can recognize that the document's OCR is not textual. There must be some fields which these readers access. This can be accessed in terminal too. This answer gives open proposals how to do it in the thread Check if a PDF file is a scanned one: <blockquote> Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options. </blockquote> I would like to understand better how you can do this effectively, since if there exists some metafield, then it would be easy. However, I have not found such a metafield. I think the most probable approach is to see if the page contains pagesized image which has OCR for search because it is effective and used in some PDF readers already. However, I do not know how to do it. <h3>Edge Detection and Hugh Transform about the answer </h3> In Hugh transform, there are specifically chosen parameters in the hyper-square of the parameter space. Its complexity $O(A^{m-2})$ where m is the amount of parameters where you see that with more than there parameters the problem is difficult. A is the size of the image space. Foxit reader is using most probably 3 parameters in their implementation. Edges are easy to detect well which can ensure the efficiency and must be done before Hugh transform. Corrupted pages are simply ignored. Other two parameters are still unknown but I think they must be nodes and some intersections. How these intersections are computed is unknown? The formulation of the exact problem is unknown. <h3>Testing Deajan's answer </h3> The command works in Debian 8.5 but I could not manage to get it work initially in Ubuntu 16.04 <pre class="prettyprint"><code>masi@masi:~$ find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi' ./Downloads/596P.pdf ./Downloads/20160406115732.pdf ^C </code></pre> OS: Debian 8.5 64 bit Linux kernel: 4.6 of backports Hardware: Asus Zenbook UX303UA

Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only: <pre class="prettyprint"><code>find ./ -name "*.pdf" -print0 | xargs -0 -I {} \ bash -c 'export file="{}"; \ if [ $(pdffonts "$file" 2> /dev/null | \ wc -l) -lt 3 ]; then echo "$file"; fi' </code></pre> <ul> <li>pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers. </li> </ul> As one-liner <pre class="prettyprint"><code>find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi' </code></pre> Explanation: <code>pdffonts file.pdf</code> will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text. My OCR project which has the same feature is in Github deajan/pmOCR.

How do I find all image-based PDFs?

Tags:

java

python

pdf

ocr

debian

I have many PDF documents in my system, and I notice sometimes that documents are image-based without editing capability. In this case, I do OCR for better search in Foxit PhantomPDF where you can do OCR in multiple files. I would like to find all PDF documents of mine which are image-based.

I do not understand how the PDF reader can recognize that the document's OCR is not textual. There must be some fields which these readers access. This can be accessed in terminal too. This answer gives open proposals how to do it in the thread Check if a PDF file is a scanned one:

Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.

I would like to understand better how you can do this effectively, since if there exists some metafield, then it would be easy. However, I have not found such a metafield. I think the most probable approach is to see if the page contains pagesized image which has OCR for search because it is effective and used in some PDF readers already. However, I do not know how to do it.

Edge Detection and Hugh Transform about the answer

In Hugh transform, there are specifically chosen parameters in the hyper-square of the parameter space. Its complexity $O(A^{m-2})$ where m is the amount of parameters where you see that with more than there parameters the problem is difficult. A is the size of the image space. Foxit reader is using most probably 3 parameters in their implementation. Edges are easy to detect well which can ensure the efficiency and must be done before Hugh transform. Corrupted pages are simply ignored. Other two parameters are still unknown but I think they must be nodes and some intersections. How these intersections are computed is unknown? The formulation of the exact problem is unknown.

Testing Deajan's answer

The command works in Debian 8.5 but I could not manage to get it work initially in Ubuntu 16.04

masi@masi:~$ find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'
./Downloads/596P.pdf
./Downloads/20160406115732.pdf
^C

OS: Debian 8.5 64 bit
Linux kernel: 4.6 of backports
Hardware: Asus Zenbook UX303UA

958

asked Dec 04 '15 16:12

Léo Léopold Hertz 준영

2 Answers

Being late for the party, here's a simple solution implying that pdf files already containing fonts aren't image based only:

find ./ -name "*.pdf" -print0 | xargs -0 -I {}      \ 
    bash -c 'export file="{}";                          \
    if [ $(pdffonts "$file" 2> /dev/null | \
    wc -l) -lt 3 ]; then echo "$file"; fi'

pdffonts lists all embedded fonts in a PDF file. If the contains searchable text, it also must contain fonts, so pdffonts will list them. Checking if result has less than three lines is because pdffonts' header is 2 lines. So all results lower than 3 lines don't have embedded fonts. AFAIK, there shouldn't be false positives altough this is more a question to ask pdffonts developers.

As one-liner

find ./ -name "*.pdf" -print0 | xargs -0 -I {} bash -c 'export file="{}"; if [ $(pdffonts "$file" 2> /dev/null | wc -l) -lt 3 ]; then echo "$file"; fi'

Explanation: pdffonts file.pdf will show more than 2 lines if pdf contains text. Outputs filenames of all pdf files that don't contain text.

My OCR project which has the same feature is in Github deajan/pmOCR.

191

answered Oct 21 '22 22:10

Orsiris de Jong

Purely from OCR field, we can use the Hough transform to find the biggest square in a page, then we calculate the ratio of its area and the whole area. If the ratio is low, we can think this page is slopping. Finally, statistics proportion of slopping pages and the page sum can indicate whether this PDF is scanned PDF.

I know the process is very slow and the proportion is difficult to determine. ^-^

answered Oct 21 '22 22:10

xu2mao

Related questions
                            
                                HashSet vs ArrayList CPU usage is high
                            
                                Hibernate - sqlQuery map redundant records while using JOIN on OneToMany
                            
                                Parent class Constructor
                            
                                Android M Permissions with Parse Push Notifications
                            
                                android - SSL problems in android studio emulator, works fine on phone
                            
                                Null pointer exception, "Attempt to read from field on a null object reference"
                            
                                how to add a contact to a group android
                            
                                How to generate end tag for empty element in XML using JAXB
                            
                                Android - Smack presence and "is typing" features not working
                            
                                How to design a generic action class while keeping clean programming practices?
                            
                                How to use DataBindingUtil with an Android spinner?
                            
                                how to bind data to list in spring form
                            
                                How to binary-search on one field of a List's elements
                            
                                wsdl: how to generate exception with errorCode and errorMessage inlined?
                            
                                How to programmatically connect internet via datacard with AT commands?
                            
                                AVD not running
                            
                                How to configure Spring to use aspectj for Transactional?
                            
                                FindIterable<Document> how to get total records in a result of a query
                            
                                Illegal UTF-8 sequence connecting with postgreSQL database
                            
                                String vs byte array representation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With