I am using following code to draw rectangle on an image text for matching date pattern and its working fine.
import re
import cv2
import pytesseract
from PIL import Image
from pytesseract import Output
img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())
date_pattern = '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$'
n_boxes = len(d['text'])
for i in range(n_boxes):
if int(d['conf'][i]) > 60:
if re.match(date_pattern, d['text'][i]):
(x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow('img', img)
cv2.waitKey(0)
img.save("sample.pdf")
Now, at the end I am getting a PDF with rectangle on matched date pattern.
I want to give this program scanned PDF as input instead of image above. It should first convert PDF into image format readable by opencv for same processing as above. Please help. (Any workaround is fine. I need a solution in which I can convert PDF to image and use it directly instead of saving on disk and read them again from there. As I have lot of PDFs to process.)
There is a library named pdf2image. You can install it with pip install pdf2image. Then, you can use the following to convert pages of the pdf to images of the required format:
from pdf2image import convert_from_path
pages = convert_from_path("pdf_file_to_convert")
for page in pages:
page.save("page_image.jpg", "jpg")
Now you can use this image to apply opencv functions.
You can use BytesIO to do your work without saving the file:
from io import BytesIO
from PIL import Image
with BytesIO() as f:
page.save(f, format="jpg")
f.seek(0)
img_page = Image.open(f)
From PDF to opencv ready array in two lines of code. I have also added the code to resize and view the opencv image. No saving to disk.
# imports
from pdf2image import convert_from_path
import cv2
import numpy as np
# convert PDF to image then to array ready for opencv
pages = convert_from_path('sample.pdf')
img = np.array(pages[0])
# opencv code to view image
img = cv2.resize(img, None, fx=0.5, fy=0.5)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()
Remember if you do not have poppler in your Windows PATH variable you can provide the path to convert_form_path
poppler_path = r'C:\path_to_poppler'
pages = convert_from_path('sample.pdf', poppler_path=poppler_path)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With