I'm looking for a method of classifying scanned pages that consist largely of text.
Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.
More details:
EDIT:
I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.
Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.
Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.
Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.
The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.
First, I would like to say that on my mind OpenCV is a very good tool for these kinds of manipulation. Moreover, it has a python interface well-described here.
OpenCV is highly optimized and your problem is not an easy one.
[GLOBAL EDIT : reorganization of my ideas]
Here's a few idea of features that could be used :
For detecting the barcodes you should maybe try to do a distance transform (DistTransform in OpenCV) if the barcode are isolated. Maybe you will be able to find interest pointseasily with match or matchShapes. I think it's feasible because the barcodes shoudl have the same shape (size, etc). The score of the interest points could be used as a feature.
The moments of the image could be useful here because you have different kinds of global structures. This will be maybe sufficient for making distinction between A & B pages (see there for the openCV function) (you will get invariant descriptors by the way :) )
You should maybe try to compute vertical gradient
and horizontal gradient
. A barcode is a specific place where vertical gradient
==0 and horizontal gradient
!=0. This main advantage is the low cost of these operations since your goal is only to check if there's such a zone on your page. You can find interest zone and use its score as a feature
Once you have your features, you can try to do supervised learning
and test generalization. Your problem require very few false negative
(because you are going to throw away some pages) so you should evaluate your performance with ROC curves and look carefully at the sensistivity (that should be high).
For the classification, you could use regression with lasso penalization to find the best features.
The post of whatnick also gives goods ideas and other descriptors (maybe more general).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With