Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect edges on an image of a document, and cut sections into seperate images?

The task is to take an image of a document, and leverage straight lines surrounding different 'sections' in order to split up the image into different documents for further parsing. Size of the different 'sections' is completely variable from page to page (we're dealing with several thousand pages). Here is an image of what one of these images looks like:

Example of how the documents are laid out:

Example Doc

Image analysis/manipulation is completely new to me. So far I've attempted to use Scikit-image edge detection algorithms to find the 'boxes', with hopes to use those 'coordinates' to cut the image. However, the two algorithms I've tried (Canny, Hough) are picking up lines of text as 'edges' on high sensitivity, and not picking up the lines that I want on low sensitivity. I could write something custom and low level to detect the boxes myself, but I have to assume this is a solved problem.

Is my approach headed in the right direction? Thank you!

like image 474
migsvult Avatar asked Mar 06 '17 05:03

migsvult


1 Answers

You don't seem to be getting any OpenCV answers, so I had a try with ImageMagick, just in the Terminal at the command-line. ImageMagick is installed on most Linux distros and is available for macOS and Windows for free. The technique is pretty readily adaptable to OpenCV so you can port it across if it works for you.

My first step was to do a 5x5 box filter and threshold at 80% to get rid of noise an scanning artefacts and then invert (probably because I was planning on using morphology, but didn't in the end).

convert news.jpg -depth 16 -statistic mean 5x5 -threshold 80% -negate z.png

enter image description here

I then ran that through "Connected Components Analysis" and discarded all blobs with too small an area (under 2000 pixels):

convert news.jpg -depth 16 -statistic mean 5x5 -threshold 80% -negate  \
   -define connected-components:verbose=true                           \
   -define connected-components:area-threshold=2000                    \
   -connected-components 4 -auto-level output.png

Output

Objects (id: bounding-box centroid area mean-color):
  110: 1254x723+59+174 686.3,536.0 901824 srgb(0,0,0)
  2328: 935x723+59+910 526.0,1271.0 676005 srgb(0,0,0)
  0: 1370x1692+0+0 685.2,712.7 399651 srgb(0,0,0)
  2329: 303x722+1007+911 1158.0,1271.5 218766 srgb(0,0,0)
  25: 1262x40+54+121 685.2,140.5 49820 srgb(255,255,255)
  109: 1265x735+54+168 708.3,535.0 20601 srgb(255,255,255)
  1: 1274x64+48+48 675.9,54.5 16825 srgb(255,255,255)
  2326: 945x733+54+905 526.0,1271.0 16660 srgb(255,255,255)  
  2327: 312x732+1003+906 1169.9,1271.5 9606 srgb(255,255,255)  <--- THIS ONE
  421: 403x15+328+342 528.6,350.1 4816 srgb(255,255,255)
  7: 141x23+614+74 685.5,85.2 2831 srgb(255,255,255)

The fields are labelled in the first line, but the interesting ones are the second (block geometry) and fourth field (blob area). As you can see, there are 11 lines so it has found 11 blobs in the image. The second field, AxB+C+D means a rectangle A pixels wide by B pixels tall with its top-left corner C pixels from the left edge of the image and D pixels down from the top.

Let's look at the one I have marked with an arrow, which starts 2327: 312x732+1003+906 and draw a rectangle over that one:

convert news.jpg -fill "rgba(255,0,0,0.5)" -draw "rectangle 1003,906 1315,1638" oneArticle.png

enter image description here

If you want to crop that article out into a new image:

convert news.jpg -crop 312x732+1003+906 article.jpg

enter image description here

If we draw in all the other boxes , we get:

enter image description here

like image 177
Mark Setchell Avatar answered Nov 08 '22 10:11

Mark Setchell