How to detect paragraphs in a text document image for a non-consistent text structure in Python OpenCV

Tags:

I am trying to identify paragraphs of text in a .pdf document by first converting it into an image then using OpenCV. But I am getting bounding boxes on lines of text instead of paragraphs. How can I set some threshold or some other limit to get paragraphs instead of lines?

Here is the sample input image:

input

Here is the output I am getting for the above sample:

output

I am trying to get a single bounding box on the paragraph in the middle. I am using this code.

import cv2
import numpy as np

large = cv2.imread('sample image.png')
rgb = cv2.pyrDown(large)
small = cv2.cvtColor(rgb, cv2.COLOR_BGR2GRAY)

# kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (3, 3))
kernel = np.ones((5, 5), np.uint8)
grad = cv2.morphologyEx(small, cv2.MORPH_GRADIENT, kernel)

_, bw = cv2.threshold(grad, 0.0, 255.0, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (9, 1))
connected = cv2.morphologyEx(bw, cv2.MORPH_CLOSE, kernel)

# using RETR_EXTERNAL instead of RETR_CCOMP
contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
#For opencv 3+ comment the previous line and uncomment the following line
#_, contours, hierarchy = cv2.findContours(connected.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

mask = np.zeros(bw.shape, dtype=np.uint8)

for idx in range(len(contours)):
    x, y, w, h = cv2.boundingRect(contours[idx])
    mask[y:y+h, x:x+w] = 0
    cv2.drawContours(mask, contours, idx, (255, 255, 255), -1)
    r = float(cv2.countNonZero(mask[y:y+h, x:x+w])) / (w * h)

    if r > 0.45 and w > 8 and h > 8:
        cv2.rectangle(rgb, (x, y), (x+w-1, y+h-1), (0, 255, 0), 2)


cv2.imshow('rects', rgb)
cv2.waitKey(0)

306

asked Jul 29 '19 07:07

Achal Gambhir

1 Answers

This is a classic situation for dilate. Whenever you want to connect multiple items together, you can dilate them to join adjacent contours into a single contour. Here's a simple approach:

Obtain binary image. Load the image, convert to grayscale, Gaussian blur, then Otsu's threshold to obtain a binary image.
Connect adjacent words together. We create a rectangular kernel and dilate to merge individual contours together.
Detect paragraphs. From here we find contours, obtain the rectangular bounding rectangle coordinates and highlight the rectangular contours.

Otsu's threshold to obtain a binary image

enter image description here

Here's where the magic happens. We can assume that a paragraph is a section of words that are close together, to achieve this we dilate to connect adjacent words

enter image description here

Result

enter image description here

import cv2
import numpy as np

# Load image, grayscale, Gaussian blur, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (7,7), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Create rectangular structuring element and dilate
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
dilate = cv2.dilate(thresh, kernel, iterations=4)

# Find contours and draw rectangle
cnts = cv2.findContours(dilate, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    x,y,w,h = cv2.boundingRect(c)
    cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)

cv2.imshow('thresh', thresh)
cv2.imshow('dilate', dilate)
cv2.imshow('image', image)
cv2.waitKey()

answered Sep 19 '22 14:09

nathancy

Related questions
                            
                                Setting up periodic tasks in Celery (celerybeat) dynamically using add_periodic_task
                            
                                debug Flask server inside Jupyter Notebook
                            
                                How to create both short and long options for one option in click (python package)?
                            
                                Sort dict of dict in jinja2 loop
                            
                                How to send urlencoded parameters in POST request in python
                            
                                How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment
                            
                                How to read a large json in pandas?
                            
                                Understanding Text feature extraction TfidfVectorizer in python scikit-learn
                            
                                psycopg2.DataError: invalid input syntax for integer: "test" Getting error when moving code to test server
                            
                                count plot with stacked bars per hue [duplicate]
                            
                                django - post data query dict is empty
                            
                                How to convert an HTML table into a Python dictionary
                            
                                Formatting y-axis matplotlib with thousands separator and font size
                            
                                Could not import "D": FLASK_APP
                            
                                What is the inverse operation of np.log() and np.diff()?
                            
                                How to solve the Attribute error 'float' object has no attribute 'split' in python?
                            
                                Django Custom User --- Edit new CustomUser fields in admin
                            
                                How do I run commands in PyCharm without having to run the whole script?
                            
                                VSCode running Python 2 instead of 3
                            
                                Read CSV into a dataFrame with varying row lengths using Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to detect paragraphs in a text document image for a non-consistent text structure in Python OpenCV

Tags:

python

image

image-processing

opencv

computer-vision

Achal Gambhir

People also ask

1 Answers

nathancy

Recent Activity

Donate For Us