How to recognize Text-Presence pattern in a scanned image and crop it?

Question

Smart Cropping for Scanned Docs

Recently I took over a preservation project of old books/manuscripts. They are huge in quantity, almost 10,000 pages. I had to scan them manually with a portable scanner as they were not in a condition to be scanned in an automated book scanner.

The real problem shows up when I start editing them in Photoshop. Note that all of them are basically documents (in JPG format) and that there are absolutely no images in those documents. They are in a different language (Oriya) for which I am sure there won't be any OCR software available in near future. (If there is please let me know.)

To make those images (docs) look clean and elegant I have to crop them, position them, increase contrast a bit, clean unnecessary spots with eraser, et cetera. I was able to automate most of these processes in Photoshop, but cropping is the point where I am getting stuck. I can't automate cropping as the software can't recon the presence of text or content in a certain area of that img (doc); it just applies the value given to it for cropping.

I want a solution to automate this cropping process. I have figured out an idea for this, I don't know if it's practical enough to implement and as far as I know there's no software present in market that does this kind of thing.

The possible solution to this: This might be possible if a tool can recognize the presence of text in an image (that's not very critical as all of them are normal document images, no images in them, no patterns just plain rectangles) and crop it out right from the border of those text from each side so it can output a document image without any margin. After this rest of the tasks can be automated using Photoshop such as adding white spaces for margin, tweaking with the contrast and color make it more readable etc.

Here is an album link to the gallery. I can post more sample images if it would be useful - just let me know.

http://imageshack.us/g/1/9800204/

Here is one example from the bigger sample of images available through above link:

one example of a bigger set...

Kurt Pfeifle · Accepted Answer

Using the sample from tinypic, original scan

with ImageMagick I'd construct an algorithm along the following lines:

Contrast-stretch the original image

Values of 1% for the the black-point and 10% for the white-point seem about right.

Command:

convert                               \
   http://i46.tinypic.com/21lppac.jpg \
  -contrast-stretch 1%x10%            \
   contrast-stretched.jpg

Result: contrast-stetched result

Shave off some border pixels to get rid of the dark scanning artefacts there

A value of 30 pixels on each edge seems about right.

Command:
```
convert                   \
   contrast-stretched.jpg \
  -shave 30x30            \
   shaved.jpg   
```
Result:

De-speckle the image

No further parameter here. Repeat process 3x for better results.

Command:

convert       \
   shaved.jpg \
  -despeckle  \
  -despeckle  \
  -despeckle  \
   despeckled.jpg

Result: despeckled image

Apply a threshold to make all pixels either black or white

A value of roughly 50% seems about right.

Command:
```
convert           \
   despeckled.jpg \
  -threshold 50%  \
   b+w.jpg
```
Result:
Re-add the shaved-off pixels

Using identify -format '%Wx%H' 21lppac.jpg established that the original image had a dimension of 1536x835 pixels.

Command:
```
convert            \
   b+w.jpg         \
  -gravity center  \
  -extent 1536x835 \
   big-b+w.jpg
```
Result: (Note, this step was only optional. It's purpose is to get back to the original image dimensions, which you may want in case you'd go from here and overlay the result with the original, or whatever...)
De-Skew the image

A threshold of 40% (the default) seems to work here too.

Command:
```
convert        \
   big-b+w.jpg \
  -deskew 40%  \
   deskewed.jpg
```
Result:
Remove from each edge all rows and colums of pixels which are purely white

This can be achieved by simply using the -trim operator.

Command:
```
convert         \
   deskewed.jpg \
  -trim         \
   trimmmed.jpg
```
Result:

As you can see, the result is not yet perfect:

there remain some random artefacts on the bottom edge of the image, and
the final trimming didn't remove all white-space from the edges because of other minimal artifacts;
also, I didn't (yet) attempt to apply a distortion correction to the image in order to fix (some of) the distortion. (You can get an idea about what it could achieve by looking at this answer to "Understanding Perspective Projection Distortion ImageMagick".)

Of course, you can easily achieve even better results by playing with a few of the parameters used in each step.

And of course, you can easily automate this process by putting each command into a shell or batch script.

Update

Ok, so here is a distortion to roughly rectify the deformation.

*Command:

convert                                                                         \
   trimmmed.jpg                                                                 \
  -distort perspective '0,0 0,0  1300,0 1300,0  0,720 0,720  1300,720 1300,770' \
   distort.jpg

Result: (once more with the original underneath, to make direct visual comparison more easy) un-distorted image original image

There is still some portion of barrel-like distortion in the image, which can probably be removed by applying the -barrelinverse operator -- we'd just need to find the fitting parameters.

How to recognize Text-Presence pattern in a scanned image and crop it?

Tags:

image-processing

imagemagick

photoshop

photoshop-cs4

Smart Cropping for Scanned Docs

Dave

1 Answers

Update

Kurt Pfeifle

Recent Activity

Donate For Us

How to recognize Text-Presence pattern in a scanned image and crop it?

Tags:

image-processing

imagemagick

photoshop

photoshop-cs4

Smart Cropping for Scanned Docs

Dave

1 Answers

Update

Kurt Pfeifle

Related questions

Recent Activity

Donate For Us