Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

article extraction from newspaper image in python and opencv

Tags:

python

opencv

first image: this is the image I tried Run length smoothing algorithm horizontal and vertical with some pixel value depends on the dimension of the image

second image: Another image to extract article by increasing the pixel value but merging with other articles

I tried extracting articles from the newspaper image, but headings are being separated with rlsa algorithm horizontal and vertical of some pixel value in the first image. If I tried with more pixel value, articles are merging which is showed in second image. Can anyone suggest the best method to separate articles from the image in python and opencv?

This loop is for run-length-smoothing-algorithm-horizontal on the image

    for i in range(1,a):
        c = 1
        for j in range(1, b):
            if im_bw[i, j] == 0:
                if (j-c) <= 10:
                    im_bw[i, c:j] = 0
                
                c = j
            
        
        if (b - c) <= 10:
            im_bw[i, c:b] = 0
    

This loop is for run-length-smoothing-algorithm-vertical on the image

    for i in range(1, b):
        c = 1
        for j in range(1, a):
            if im_bw[j, i] == 0:
                if (j-c) <= 9:
                    im_bw[c:j, i] = 0
                
                c = j
            
        
        if (b - c) <= 9:
            im_bw[c:b, i] = 0

a is number of rows b is number of columns of an binary image

How algorithm worked on binary image and red mark shows the merging of articles

like image 758
vasista Avatar asked Oct 31 '22 16:10

vasista


1 Answers

I have an approach worked for most of the images.

  1. Binary conversion of color/gray scale images using PIL/Opencv.
  2. Remove pictures from image as contours with largest area compared to average area of all the contours present in the image.
  3. Remove lines using canny edge filter and houghlines
  4. Use RLSA(run length smoothing algorithm) on this binary image. Description and Code for this RLSA can be found on this repository https://github.com/Vasistareddy/python-rlsa

Removing lines helps because some e-papers keeps lines as article separator. We can achieve better results with more processing of the images. Heuristics like average width, average height, average area can be implemented on the contours left on the image after applying above steps to achieve better results.

Coming to the above question, the articles always with the white background. Without white background are clearly "Ads" or "pictures" or "miscellaneous" stuff. Removing pictures from the above 4 mentioned steps clears solves this issue.

PS: Choosing a value for RLSA horizontal and vertical is always mystery. Since the gap of the article varies from edition to edition.

Edit:

the above problem is basically applying Heuristics. Read through this

https://medium.com/@vasista/extract-title-from-the-image-documents-in-python-application-of-rlsa-58f91237901f

like image 162
vasista Avatar answered Nov 08 '22 17:11

vasista