Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programmatically divide scanned images into separate images

In order to improve OCR quality, I need to preprocess my scanned images. Sometimes I need to OCR the image with few pictures (components on the page and they are at different angles - for example, a few paper documents scanned at one time), for example:

enter image description here

Is it possible to automatically programmatically divide such images into separate images that will contain every logical document? For example with a tool like ImageMagick or something else? Is there any solutions/technics exists for such problem?

like image 965
alexanoid Avatar asked Feb 01 '18 07:02

alexanoid


1 Answers

alexanoid wrote: I have added another image with scanning artifacts. Will this approach work on such images also?

No it will not work well for several reasons. The second image you provide was much larger than the first. So it would need a much larger blur. It is jpg and has artifacts in it. JPG is not a good format, since the image in 'constant' regions is not really constant. The blur will pick up your artifacts and will need to have a different threshold to remove some of them. In your case, the top of the image has a good sized artifact that will get caught as an object. Finally your blurred and thresholded text region's bounding boxes overlap even if they do not touch. Thus one crop may include text from other regions.

Here is my test command to blur and threshold your image:

convert image.jpg -blur 0x50 -auto-level -threshold 95% -type bilevel tmp.png

enter image description here

like image 71
fmw42 Avatar answered Nov 18 '22 17:11

fmw42