Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Image classification in python

I'm looking for a method of classifying scanned pages that consist largely of text.

Here are the particulars of my problem. I have a large collection of scanned documents and need to detect the presence of certain kinds of pages within these documents. I plan to "burst" the documents into their component pages (each of which is an individual image) and classify each of these images as either "A" or "B". But I can't figure out the best way to do this.

More details:

  • I have numerous examples of "A" and "B" images (pages), so I can do supervised learning.
  • It's unclear to me how to best extract features from these images for the training. E.g. What are those features?
  • The pages are occasionally rotated slightly, so it would be great if the classification was somewhat insensitive to rotation and (to a lesser extent) scaling.
  • I'd like a cross-platform solution, ideally in pure python or using common libraries.
  • I've thought about using OpenCV, but this seems like a "heavy weight" solution.

EDIT:

  • The "A" and "B" pages differ in that the "B" pages have forms on them with the same general structure, including the presence of a bar code. The "A" pages are free text.
like image 324
Kyle Avatar asked Oct 11 '10 13:10

Kyle


2 Answers

I will answer in 3 parts since your problem is clearly a large one and I would highly recommend manual method with cheap labour if the collection of pages does not exceed a 1000.

Part 1: Feature Extraction - You have a very large array of features to choose from in the object detection field. Since one of your requirements is rotation invariance, I would recommend the SIFT/SURF class of features. You might also find Harris corners etc. suitable. Deciding which features to use can require expert knowledge and if you have computing power I would recommend creating a nice melting pot of features and passing it through a classifier training based importance estimator.

Part 2: Classifier Selection - I am a great fan of the Random Forest classifier. The concept is very simple to grasp and it is highly flexible and non-parametric. Tuning requires very few parameters and you can also run it in a parameter selection mode during supervised training.

Part 3: Implementation - Python in essence is a glue language. Pure python implementations for image processing are never going to be very fast. I recommend using a combination of OpenCV for feature detection and R for statistical work and classifiers.

The solution may seem over-engineered but machine learning has never been a simple task even when the difference between pages is simply that they are the left-hand and right-hand pages of a book.

like image 63
whatnick Avatar answered Oct 11 '22 21:10

whatnick


First, I would like to say that on my mind OpenCV is a very good tool for these kinds of manipulation. Moreover, it has a python interface well-described here.

OpenCV is highly optimized and your problem is not an easy one.

[GLOBAL EDIT : reorganization of my ideas]

Here's a few idea of features that could be used :

  • For detecting the barcodes you should maybe try to do a distance transform (DistTransform in OpenCV) if the barcode are isolated. Maybe you will be able to find interest pointseasily with match or matchShapes. I think it's feasible because the barcodes shoudl have the same shape (size, etc). The score of the interest points could be used as a feature.

  • The moments of the image could be useful here because you have different kinds of global structures. This will be maybe sufficient for making distinction between A & B pages (see there for the openCV function) (you will get invariant descriptors by the way :) )

  • You should maybe try to compute vertical gradient and horizontal gradient. A barcode is a specific place where vertical gradient==0 and horizontal gradient!=0. This main advantage is the low cost of these operations since your goal is only to check if there's such a zone on your page. You can find interest zone and use its score as a feature

Once you have your features, you can try to do supervised learning and test generalization. Your problem require very few false negative (because you are going to throw away some pages) so you should evaluate your performance with ROC curves and look carefully at the sensistivity (that should be high). For the classification, you could use regression with lasso penalization to find the best features. The post of whatnick also gives goods ideas and other descriptors (maybe more general).

like image 29
ThR37 Avatar answered Oct 11 '22 22:10

ThR37