Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract images from PDF with PHP

Tags:

php

image

pdf

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once.

I already looked around and out of the box PHP can't read PDF's.

What are my alternatives?

I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. That's basicly leaving me with nothing to work with, I guess?

Does anyone know if there is an online service that can do this, with an api of sorts?

thanks in adv

like image 287
Richard Avatar asked Dec 05 '13 14:12

Richard


2 Answers

AFAIK, there is no PHP module to do it. There is a command line tool, pdfimages (part of xpdf). For reference, here's how that works:

pdfimages -j source.pdf image

Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg.

Possible Options

Being a command line tool, you need exec (or system, passthru, any of the command executing functions built into PHP). As your environment doesn't have that, I see four options:

  1. Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command)
  2. Change the design -- how about a ZIP upload?
  3. Roll your own, using the source code of pdfimages as a model
  4. Let pdfimages do the heavy lifting, by running it on a remote host you do control

Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. However, that may be too much...

If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you").

  • Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl
  • Install pdfimages
  • Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images.

An example exchange could look like this:

GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf

Content-type: text/html


<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>

So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. When retrieving, it just gives you back a straight image.

You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.

like image 148
bishop Avatar answered Nov 07 '22 05:11

bishop


You can use pdfimages and install it this way:

apt install poppler-utils

Then use it this way to get all the images as PNG files:

pdfimages -j mypdf.pdf image -png

Images will be placed in the same folder under image-000.png, image-001.png, etc.

There are many options available, including some to change the output format, more information here.

I hope this helps!

like image 2
Bruno Leveque Avatar answered Nov 07 '22 05:11

Bruno Leveque