Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pdfminer extract image produces multiple images per page (should be single image)

I am attempting to extract images that are in a PDF. The file I am working with is 2+ pages. Page 1 is text and pages 2-n are images (one per page, or it may be a single image spanning multiple pages; I do not have control over the origin).

I am able to parse the text out from page 1 but when I try to get the images I am getting 3 images per image page. I cannot determine the image type which makes saving it difficult. Additionally trying to save each pages 3 pictures as a single img provides no result (as in cannot be opened via finder on OSX)

Sample:

fp = open('the_file.pdf', 'rb')
parser = PDFParser(fp)
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)


for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    pdf_item = device.get_result()
    for thing in pdf_item:
        if isinstance(thing, LTImage):
            save_image(thing)
        if isinstance(thing, LTFigure):
            find_images_in_thing(thing)


def find_images_in_thing(outer_layout):
    for thing in outer_layout:
        if isinstance(thing, LTImage):
            save_image(thing)

save_image either writes a file per image in pageNum_imgNum format in 'wb' mode or a single image per page in 'a' mode. I have tried numerous file extensions with no luck.

Resources I've looked into:

http://denis.papathanasiou.org/posts/2010.08.04.post.html (outdatted pdfminer version) http://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

like image 875
Erik Avatar asked Jul 11 '16 22:07

Erik


2 Answers

It's been a while since this question has been asked, but I'll contribute for the sake of the community, and potentially for your benefit :)

I've been using an image parser called pdfimages, available through the poppler PDF processing framework. It also outputs several files per image; it seems like a relatively common behavior for PDF generators to 'tile' or 'strip' the images into multiple images that then need to be pieced together when scraping, but appear to be entirely intact while viewing the PDF. The formats/file extensions that I have seen through pdfimages and elsewhere are: png, tiff, jp2, jpg, ccitt. Have you tried all of those?

like image 136
Nikhil Shinday Avatar answered Oct 21 '22 12:10

Nikhil Shinday


Have you tried something like this?

from binascii import b2a_hex
def determine_image_type (stream_first_4_bytes):
    """Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
       file_type = None
       bytes_as_hex = b2a_hex(stream_first_4_bytes).decode()
       if bytes_as_hex.startswith('ffd8'):
          file_type = '.jpeg'
       elif bytes_as_hex == '89504e47':
          file_type = '.png'
       elif bytes_as_hex == '47494638':
          file_type = '.gif'
       elif bytes_as_hex.startswith('424d'):
          file_type = '.bmp'
       return file_type
like image 1
Dilshat Avatar answered Oct 21 '22 11:10

Dilshat