Using PIL to detect a scan of a blank page

Tags:

So I often run huge double-sided scan jobs on an unintelligent Canon multifunction, which leaves me with a huge folder of JPEGs. Am I insane to consider using PIL to analyze a folder of images to detect scans of blank pages and flag them for deletion?

Leaving the folder-crawling and flagging parts out, I imagine this would look something like:

Check if the image is greyscale, as this is presumed uncertain.
If so, detect the dominant range of shades (background colour).
If not, detect the dominant range of shades, restricting to light greys.
Determine what percentage of the entire image is composed of said shades.
Try to find a threshold that adequately detects pages with type or writing or imagery.
Perhaps test fragments of the image at a time to increase accuracy of threshold.

I know this is sort of an edge case, but can anyone with PIL experience lend some pointers?

954

asked Mar 24 '11 22:03

Christopher O'Donnell

1 Answers

Here is an alternative solution, using mahotas and milk.

Start by creating two directories: positives/ and negatives/ where you will manually pick out a few examples.
I will assume that the rest of the data is in an unlabeled/ directory
Compute features for all of the images in positives and negatives
learn a classifier
use that classifier on the unlabeled images

In the code below I used jug to give you the possibility of running it on multiple processors, but the code also works if you remove every line which mentions TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator


@TaskGenerator
def features_for(imname):
    img = mahotas.imread(imname)
    return mahotas.features.haralick(img).mean(0)

@TaskGenerator
def learn_model(features, labels):
    learner = milk.defaultclassifier()
    return learner.train(features, labels)

@TaskGenerator
def classify(model, features):
     return model.apply(features)

positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')


features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)

model = learn_model(features, labels)

labeled = [classify(model, features_for(u)) for u in unlabeled]

This uses texture features, which is probably good enough, but you can play with other features in mahotas.features if you'd like (or try mahotas.surf, but that gets more complicated). In general, I have found it hard to do classification with the sort of hard thresholds you are looking for unless the scanning is very controlled.

175

answered Oct 02 '22 12:10

luispedro

Related questions
                            
                                Why there is a difference in "import" vs. "import *"?
                            
                                C#: Does a dictionary in C# have something similar to the Python setdefault?
                            
                                Does WordNet have "levels"? (NLP)
                            
                                Why doesn't my script work, but I can manually INSERT into mysql?
                            
                                Graceful exiting of a program in Python?
                            
                                __cmp__ method is this not working as expected in Python 2.x?
                            
                                OptionParser python module - multiple entries of same variable?
                            
                                Removing unwanted characters from a string in Python
                            
                                Python 3 chokes on CP-1252/ANSI reading
                            
                                Django count related objects
                            
                                Real-world Jython applications
                            
                                How to reset global variable in python?
                            
                                Global static variables in Python
                            
                                Redirecting an old URL to a new one with Flask micro-framework
                            
                                Looping seems to not follow sequence
                            
                                matplotlib color in 3d plotting from an x,y,z data set without using contour
                            
                                Is python's print synchronized?
                            
                                Python: Combine "if 'x' in dict" and "for i in dict['x']"
                            
                                what does <> mean in Python
                            
                                Python: Best practice for dynamically constructing regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using PIL to detect a scan of a blank page

Tags:

python

python-imaging-library

computer-vision

imaging

image-scanner

Christopher O'Donnell

People also ask

1 Answers

luispedro

Recent Activity

Donate For Us