I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).
I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)
Sample:
fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
pdf_item = device.get_result()
for thing in pdf_item:
if isinstance(thing, LTImage):
save_image(thing)
if isinstance(thing, LTFigure):
find_images_in_thing(thing)
def find_images_in_thing(outer_layout):
for thing in outer_layout:
if isinstance(thing, LTImage):
save_image(thing)
save_image
either writes a file per image in pageNum_imgNum
format in 'wb'
mode or a single image per page in 'a'
mode. I have tried numerous file extensions with no luck.
Resources I've looked into:
http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version) http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html
It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)
I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?
Have you tried something like this?
from binascii import b2a_hex
def determine_image_type (stream_first_4_bytes):
"""Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
file_type = None
bytes_as_hex = b2a_hex(stream_first_4_bytes).decode()
if bytes_as_hex.startswith('ffd8'):
file_type = '.jpeg'
elif bytes_as_hex == '89504e47':
file_type = '.png'
elif bytes_as_hex == '47494638':
file_type = '.gif'
elif bytes_as_hex.startswith('424d'):
file_type = '.bmp'
return file_type
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With