I have a multi-page .pdf (scanned images) containing handwriting I would like to crop and store as new separate images. For example, in the visual below I would like to extract the handwriting inside the 2 boxes as separate images. How can I automatically do this for a large, multi-page .pdf using python?
I tried using the PyPDF2
package to crop one of the handwriting boxes based on (x,y) coordinates, however this approach doesn't work for me as the boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf. I believe detecting the boxes would be a better approach for auto-cropping. Not sure if its useful, but below is the code I used for (x,y) coordinate approach:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("data/samples.pdf", "r")
# getting the first page
page = reader.getPage(0)
writer = PdfFileWriter()
# Loop through all pages in pdf object to crop based on (x,y) coordinates
for i in range(reader.getNumPages()):
page = reader.getPage(i)
page.cropBox.setLowerLeft((42, 115))
page.cropBox.setUpperRight((500, 245))
writer.addPage(page)
with open("samples_cropped.pdf", "wb") as fp:
writer.write(fp)
Thank you in advance for your help
Here's a simple approach using OpenCV
After extracting the ROI, you can save each as a separate image and then perform OCR text extraction using pytesseract
or some other tool.
Results
You mention this
The boundaries/coordinates of the handwriting boxes wont always be the same for each page in the pdf.
Currently, your approach of using (x,y)
coordinates isn't very robust since the boxes could be anywhere on the image. A better approach is to filter using a minimum threshold contour area to detect the boxes. Depending on how small/large of a box you want to detect, you can adjust the variable. If you want additional filtering to prevent false positives, you can add into aspect ratio as another filtering mechanism. For instance, calculating aspect ratio for each contour then if it is within bounds (say 0.8
to 1.2
for a square/rectangle ROI) then it's a valid box.
import cv2
image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (3, 3), 0)
thresh = cv2.threshold(blurred, 230,255,cv2.THRESH_BINARY_INV)[1]
# Find contours
cnts = cv2.findContours(thresh, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
# Iterate thorugh contours and filter for ROI
image_number = 0
min_area = 10000
for c in cnts:
area = cv2.contourArea(c)
if area > min_area:
x,y,w,h = cv2.boundingRect(c)
cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 2)
ROI = original[y:y+h, x:x+w]
cv2.imwrite("ROI_{}.png".format(image_number), ROI)
image_number += 1
cv2.imshow('image', image)
cv2.waitKey(0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With