Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect and crop a box in .pdf or image as individual images

I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?

enter image description here

I tried using the PyPDF2 package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfFileReader("data/samples.pdf", "r")

# getting the first page
page = reader.getPage(0)

writer = PdfFileWriter()

# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
    page = reader.getPage(i)
    page.cropBox.setLowerLeft((42, 115))
    page.cropBox.setUpperRight((500, 245))
    writer.addPage(page)

with open("samples_cropped.pdf", "wb") as fp:
    writer.write(fp)

Thank you in advance for your help

like image 775
Steve Avatar asked Oct 15 '22 12:10

Steve


1 Answers

Here's a simple approach using OpenCV

  • Convert image to grayscale and Gaussian blur
  • Threshold image
  • Find contours
  • Iterate through contours and filter using contour area
  • Extract ROI

After extracting the ROI, you can save each as a separate image and then perform OCR text extraction using pytesseract or some other tool.


Results

enter image description here

enter image description here

You mention this

The boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf.

Currently, your approach of using (x,y) coordinates isn't very robust since the boxes could be anywhere on the image. A better approach is to filter using a minimum threshold contour area to detect the boxes. Depending on how small/large of a box you want to detect, you can adjust the variable. If you want additional filtering to prevent false positives, you can add into aspect ratio as another filtering mechanism. For instance, calculating aspect ratio for each contour then if it is within bounds (say 0.8 to 1.2 for a square/rectangle ROI) then it's a valid box.

import cv2

image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blurred, 230,255,cv2.THRESH_BINARY_INV)[1]

# Find contours
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# Iterate thorugh contours and filter for ROI
image_number = 0
min_area = 10000
for c in cnts:
    area = cv2.contourArea(c)
    if area > min_area:
        x,y,w,h = cv2.boundingRect(c)
        cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
        ROI = original[y:y+h, x:x+w]
        cv2.imwrite("ROI_{}.png".format(image_number), ROI)
        image_number += 1

cv2.imshow('image', image)
cv2.waitKey(0)
like image 77
nathancy Avatar answered Nov 15 '22 09:11

nathancy