Detect and crop a box in .pdf or image as individual images

Question

I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?

enter image description here

I tried using the PyPDF2 package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfFileReader("data/samples.pdf", "r")

# getting the first page
page = reader.getPage(0)

writer = PdfFileWriter()

# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
    page = reader.getPage(i)
    page.cropBox.setLowerLeft((42, 115))
    page.cropBox.setUpperRight((500, 245))
    writer.addPage(page)

with open("samples_cropped.pdf", "wb") as fp:
    writer.write(fp)

Thank you in advance for your help

nathancy · Accepted Answer

Here's a simple approach using OpenCV

Convert image to grayscale and Gaussian blur
Threshold image
Find contours
Iterate through contours and filter using contour area
Extract ROI

After extracting the ROI, you can save each as a separate image and then perform OCR text extraction using pytesseract or some other tool.

Results

enter image description here

You mention this

The boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf.

Currently, your approach of using (x,y) coordinates isn't very robust since the boxes could be anywhere on the image. A better approach is to filter using a minimum threshold contour area to detect the boxes. Depending on how small/large of a box you want to detect, you can adjust the variable. If you want additional filtering to prevent false positives, you can add into aspect ratio as another filtering mechanism. For instance, calculating aspect ratio for each contour then if it is within bounds (say 0.8 to 1.2 for a square/rectangle ROI) then it's a valid box.

import cv2

image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blurred, 230,255,cv2.THRESH_BINARY_INV)[1]

# Find contours
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Iterate thorugh contours and filter for ROI
image_number = 0
min_area = 10000
for c in cnts:
    area = cv2.contourArea(c)
    if area > min_area:
        x,y,w,h = cv2.boundingRect(c)
        cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
        ROI = original[y:y+h, x:x+w]
        cv2.imwrite("ROI_{}.png".format(image_number), ROI)
        image_number += 1

cv2.imshow('image', image)
cv2.waitKey(0)

Detect and crop a box in .pdf or image as individual images

Tags:

python

image-processing

opencv

computer-vision

pypdf2

Steve

1 Answers

nathancy

Recent Activity

Donate For Us

Detect and crop a box in .pdf or image as individual images

Tags:

python

image-processing

opencv

computer-vision

pypdf2

Steve

1 Answers

nathancy

Related questions

Recent Activity

Donate For Us