I'm using <code>pdfimages -j bar.pdf /tmp/image</code> to extract images from a PDF. My objective is to get them in their raw state as they were added. So If it was a .tif I'd like to get a .tif, if it's a jpg I'd like to get a .jpg. I keep getting .ppm for everything I extract. Is it possible to get images in their original format or is ppm my only opiton? Update: My primary objective for wanting to do this is to check the DPI of all of the images included in the document, or, check to see if they're vector.

First, what in PDF parlance is called an 'image', by definition always is a raster image. There's no such thing as a 'vector image'. Even if the original file which was converted to PDF included vector graphics, then the converter program could have decided that it includes these as raster image. If you extract this, you'll not get your vector graphics back, but a raster image. Raster graphics which are preserved inside a PDF as such cannot be extracted by <code>pdfimages</code>. Second, you do not need to actually extract the images using <code>pdfimages</code>. Provided you're using a current version (later than v0.20.2) of the 'Poppler' fork of <code>pdfimages</code> you can use the <code>-list</code> parameter to get a list of all images on a certain range of PDF pages: <pre class="prettyprint"> pdfimages -list -f 7 -l 8 ct-magazin-14-2012.pdf page num type width height color comp bpc enc interp object ID --------------------------------------------------------------------- 7 0 image 581 838 rgb 3 8 jpeg no 39 0 7 1 image 4 4 rgb 3 8 image no 40 0 7 2 image 314 332 rgb 3 8 jpx no 44 0 7 3 image 358 430 rgb 3 8 jpx no 45 0 7 4 image 4 4 rgb 3 8 image no 46 0 7 5 image 4 4 rgb 3 8 image no 47 0 7 6 image 4 6 rgb 3 8 image no 48 0 7 7 image 596 462 rgb 3 8 jpx no 49 0 7 8 image 4 6 rgb 3 8 image no 50 0 7 9 image 4 4 rgb 3 8 image no 51 0 7 10 image 8 10 rgb 3 8 image no 41 0 7 11 image 6 6 rgb 3 8 image no 42 0 7 12 image 113 27 rgb 3 8 jpx no 43 0 8 13 image 582 839 gray 1 8 jpeg no 2080 0 8 14 image 344 364 gray 1 8 jpx no 2079 0 </pre> Note again: this version of <code>pdfimages</code> is the one from Poppler (the one from XPDF does not (yet?) support this new feature). As you can see this lists the respective widths and heights of the images. This however does not (yet) give you any clue about the DPI. If a large raster image is squeezed into a small space on the PDF page, your DPI value would be quite high. (This is what plinth's comment to his own answer also emphasizes...) In order to calculate the DPI, you'll have to measure the width/height of the image as it is displayed on the page (you can do that with one of the tools in Acrobat/Reader) and then use the respective info from the above output to calculate the DPI. <hr> <h3>Update</h3> Recent versions of <code>pdfimages</code> now directly shows the actual resolution in DPI of the included images in additional columns. Obtaining this info was the original goal of the question: <pre class="prettyprint"> pdfimages -list -f 6 -l 7 example.pdf page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 6 0 image 1901 1901 rgb 3 8 image no 632 0 1818 1818 468K 4.4% 6 1 image 1901 1901 rgb 3 8 image no 645 0 1818 1818 521K 4.9% </pre> The new output format additionally shows the respective horizontal and vertical resolutions for each image ('x-ppi', 'y-ppi'). It also gives the actual size of images in terms of storage ('size') and their compression ratios ('ratio'). (Thanks to @Eric for suggesting an update hinting at these new features of <code>pdfimages</code>.)

How to extract images from a PDF in their original format

Tags:

php

pdf

xpdf

I'm using pdfimages -j bar.pdf /tmp/image to extract images from a PDF. My objective is to get them in their raw state as they were added. So If it was a .tif I'd like to get a .tif, if it's a jpg I'd like to get a .jpg. I keep getting .ppm for everything I extract.

Is it possible to get images in their original format or is ppm my only opiton?

Update: My primary objective for wanting to do this is to check the DPI of all of the images included in the document, or, check to see if they're vector.

922

asked Jan 25 '13 13:01

Webnet

2 Answers

First, what in PDF parlance is called an 'image', by definition always is a raster image. There's no such thing as a 'vector image'. Even if the original file which was converted to PDF included vector graphics, then the converter program could have decided that it includes these as raster image. If you extract this, you'll not get your vector graphics back, but a raster image. Raster graphics which are preserved inside a PDF as such cannot be extracted by pdfimages.

Second, you do not need to actually extract the images using pdfimages. Provided you're using a current version (later than v0.20.2) of the 'Poppler' fork of pdfimages you can use the -list parameter to get a list of all images on a certain range of PDF pages:

pdfimages -list -f 7 -l 8  ct-magazin-14-2012.pdf

  page   num  type   width height color comp bpc  enc interp  object ID
  ---------------------------------------------------------------------
     7     0 image     581   838  rgb     3   8  jpeg   no        39  0
     7     1 image       4     4  rgb     3   8  image  no        40  0
     7     2 image     314   332  rgb     3   8  jpx    no        44  0
     7     3 image     358   430  rgb     3   8  jpx    no        45  0
     7     4 image       4     4  rgb     3   8  image  no        46  0
     7     5 image       4     4  rgb     3   8  image  no        47  0
     7     6 image       4     6  rgb     3   8  image  no        48  0
     7     7 image     596   462  rgb     3   8  jpx    no        49  0
     7     8 image       4     6  rgb     3   8  image  no        50  0
     7     9 image       4     4  rgb     3   8  image  no        51  0
     7    10 image       8    10  rgb     3   8  image  no        41  0
     7    11 image       6     6  rgb     3   8  image  no        42  0
     7    12 image     113    27  rgb     3   8  jpx    no        43  0
     8    13 image     582   839  gray    1   8  jpeg   no      2080  0
     8    14 image     344   364  gray    1   8  jpx    no      2079  0

Note again: this version of pdfimages is the one from Poppler (the one from XPDF does not (yet?) support this new feature).

As you can see this lists the respective widths and heights of the images. This however does not (yet) give you any clue about the DPI. If a large raster image is squeezed into a small space on the PDF page, your DPI value would be quite high. (This is what plinth's comment to his own answer also emphasizes...)

In order to calculate the DPI, you'll have to measure the width/height of the image as it is displayed on the page (you can do that with one of the tools in Acrobat/Reader) and then use the respective info from the above output to calculate the DPI.

Update

Recent versions of pdfimages now directly shows the actual resolution in DPI of the included images in additional columns. Obtaining this info was the original goal of the question:

  pdfimages -list -f 6 -l 7 example.pdf
  page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
  --------------------------------------------------------------------------------------------
     6     0 image    1901  1901  rgb     3   8  image  no       632  0  1818  1818  468K 4.4%
     6     1 image    1901  1901  rgb     3   8  image  no       645  0  1818  1818  521K 4.9%

The new output format additionally shows the respective horizontal and vertical resolutions for each image ('x-ppi', 'y-ppi'). It also gives the actual size of images in terms of storage ('size') and their compression ratios ('ratio').

_{(Thanks to @Eric for suggesting an update hinting at these new features of pdfimages.)}

192

answered Oct 26 '22 22:10

Kurt Pfeifle

You can't (reliably) know the source image file format by looking at an image in PDF. For example, TIFF images can be compressed with (off the top of me head) none, RLE, CCITT (couple variations), LZW, Flate, Jpeg. If an image in a PDF is compressed with DCT (jpeg), how do you decide whether or not the source was TIFF or Jpeg? If it is compressed with Flate, how do you distinguish between TIFF and PNG? Further, it is the software generating the PDF which decides the compression, so I can take a Flate compressed TIFF image and encode it into a PDF using JPEG2000 or a CCITT compressed image and compress it with Jbig2 or a jpeg image, reduce it to an 8-bit paletted image and compress it with Flate.

TL;DR you can't know.

answered Oct 26 '22 23:10

plinth

Related questions
                            
                                How to print or echo the array index of in PHP
                            
                                Does extending a parent class in PHP require the file with the class being included?
                            
                                How to customize form validation errors in codeIgniter
                            
                                Is ip2long() in PHP equal to INET_ATON() function in MySQL?
                            
                                Zend: Select object: How do I replace the selected columns set by from()?
                            
                                Lock mysql table with php
                            
                                findBy with JOIN criteria in Symfony2
                            
                                How to convert yyyy-MM-ddTHH:mm:ssZ to yyyy-MM-dd HH:mm:ss?
                            
                                Use of closing database connection in php
                            
                                PDO and MySQL 'between'
                            
                                select rows as well as a total count in one query in mysql
                            
                                Override Magento Config
                            
                                how to call parent class method in php
                            
                                Manipulating utf8mb4 data from MySQL with PHP
                            
                                Is Object-Oriented Programming in Interpreted languages (i.e, PHP) efficient? [closed]
                            
                                Instagram API custom image width
                            
                                mysql order by rand() performance issue and solution
                            
                                Put all indexs of array in a comma separated string [closed]
                            
                                Twig doesn't render HTML tags
                            
                                How to add Google Analytics code to Drupal 7

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With