How to convert PDF into image readable by opencv-python?

Question

I am using following code to draw rectangle on an image text for matching date pattern and its working fine.

import re
import cv2
import pytesseract
from PIL import Image
from pytesseract import Output

img = cv2.imread('invoice-sample.jpg')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())

date_pattern = '^(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d$'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if int(d['conf'][i]) > 60:
        if re.match(date_pattern, d['text'][i]):
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)

cv2.imshow('img', img)
cv2.waitKey(0)
img.save("sample.pdf")

Now, at the end I am getting a PDF with rectangle on matched date pattern.

I want to give this program scanned PDF as input instead of image above. It should first convert PDF into image format readable by opencv for same processing as above. Please help. (Any workaround is fine. I need a solution in which I can convert PDF to image and use it directly instead of saving on disk and read them again from there. As I have lot of PDFs to process.)

dewDevil · Accepted Answer

There is a library named pdf2image. You can install it with pip install pdf2image. Then, you can use the following to convert pages of the pdf to images of the required format:

from pdf2image import convert_from_path

pages = convert_from_path("pdf_file_to_convert")
for page in pages:
    page.save("page_image.jpg", "jpg")

Now you can use this image to apply opencv functions.

You can use BytesIO to do your work without saving the file:

from io import BytesIO
from PIL import Image

with BytesIO() as f:
   page.save(f, format="jpg")
   f.seek(0)
   img_page = Image.open(f)

Cam · Answer

From PDF to opencv ready array in two lines of code. I have also added the code to resize and view the opencv image. No saving to disk.

# imports
from pdf2image import convert_from_path
import cv2
import numpy as np

# convert PDF to image then to array ready for opencv
pages = convert_from_path('sample.pdf')
img = np.array(pages[0])

# opencv code to view image
img = cv2.resize(img, None, fx=0.5, fy=0.5)
cv2.imshow("img", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Remember if you do not have poppler in your Windows PATH variable you can provide the path to convert_form_path

poppler_path = r'C:\path_to_poppler'
pages = convert_from_path('sample.pdf', poppler_path=poppler_path)

How to convert PDF into image readable by opencv-python?

Tags:

python

python-imaging-library

tesseract

python-tesseract

P.Natu

2 Answers

dewDevil

Cam

Recent Activity

Donate For Us

How to convert PDF into image readable by opencv-python?

Tags:

python

python-imaging-library

tesseract

python-tesseract

P.Natu

2 Answers

dewDevil

Cam

Related questions

Recent Activity

Donate For Us