Given a PDF, how to extract the images *and their locations on the page* from the command line?

Question

I have a PDF which includes text and images. I want to extract images from the PDF using the linux command line. I can use pdfimages to extract the images, but I also want to find the location on each page where that image is. pdfimages can tell me what page each image (from the filename), however that's all it gives me. Is there any other FLOSS tool that can do this?

Eric Fortis · Accepted Answer

Well I think the PDF must contain the info for placing them, so this should be possible. On the other hand a solution can be e.g.:

Convert each pdf page to an image with pdftoppm
Extract the images from each page with pdfimages
Convert the images to a single 8-bits grey-scale channel (for faster analysis) with cvCvtColor
Object detection with matchTemplate

Step 1 may look similar to this Step 2:

for i in {0..99} ; do pdfimages -f $((i)) -l $((i+1)) file.pdf page$((i)); done

Step 3 here* a simple example

In Step 4 you should not have problems with training, because the image will be an exact match. matchTemplate( imageToSearch, pdfPageImg, outputMap, 'CV_TM_SQDIFF')

(* - link removed as it now appears to be pointing towards a ransomware site)

someuser9809 · Answer

There's an -xml switch for the pdftohtml command which will give image position, dimension and source information.

pdftohtml -xml file.pdf

mark stephens · Answer

There is no guarantee in PDF that if an image is reused it will not be a separate image. There is very little image metadata in a PDF file beyond the page location and its actual size on the page. I wrote an article explaining how images are stored inside a PDF at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/

Given a PDF, how to extract the images and their locations on the page from the command line?

Tags:

linux

command-line

pdf

Amandasaurus

3 Answers

Eric Fortis

someuser9809

mark stephens

Recent Activity

Donate For Us