Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Given a PDF, how to extract the images *and their locations on the page* from the command line?

I have a PDF which includes text and images. I want to extract images from the PDF using the linux command line. I can use pdfimages to extract the images, but I also want to find the location on each page where that image is. pdfimages can tell me what page each image (from the filename), however that's all it gives me. Is there any other FLOSS tool that can do this?

like image 532
Amandasaurus Avatar asked Jan 03 '11 00:01

Amandasaurus


3 Answers

Well I think the PDF must contain the info for placing them, so this should be possible. On the other hand a solution can be e.g.:

  1. Convert each pdf page to an image with pdftoppm
  2. Extract the images from each page with pdfimages
  3. Convert the images to a single 8-bits grey-scale channel (for faster analysis) with cvCvtColor
  4. Object detection with matchTemplate

Step 1 may look similar to this Step 2:

for i in {0..99} ; do pdfimages -f $((i)) -l $((i+1)) file.pdf page$((i)); done

Step 3 here* a simple example

In Step 4 you should not have problems with training, because the image will be an exact match. matchTemplate( imageToSearch, pdfPageImg, outputMap, 'CV_TM_SQDIFF')

(* - link removed as it now appears to be pointing towards a ransomware site)

like image 107
Eric Fortis Avatar answered Sep 17 '22 20:09

Eric Fortis


There's an -xml switch for the pdftohtml command which will give image position, dimension and source information.

pdftohtml -xml file.pdf
like image 21
someuser9809 Avatar answered Sep 16 '22 20:09

someuser9809


There is no guarantee in PDF that if an image is reused it will not be a separate image. There is very little image metadata in a PDF file beyond the page location and its actual size on the page. I wrote an article explaining how images are stored inside a PDF at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/

like image 37
mark stephens Avatar answered Sep 18 '22 20:09

mark stephens