I have a PDF which includes text and images. I want to extract images from the PDF using the linux command line. I can use pdfimages
to extract the images, but I also want to find the location on each page where that image is. pdfimages
can tell me what page each image (from the filename), however that's all it gives me. Is there any other FLOSS tool that can do this?
Well I think the PDF must contain the info for placing them, so this should be possible. On the other hand a solution can be e.g.:
pdftoppm
pdfimages
cvCvtColor
matchTemplate
Step 1 may look similar to this Step 2:
for i in {0..99} ; do pdfimages -f $((i)) -l $((i+1)) file.pdf page$((i)); done
Step 3 here* a simple example
In Step 4 you should not have problems with training, because the image will be an exact match. matchTemplate( imageToSearch, pdfPageImg, outputMap, 'CV_TM_SQDIFF')
(* - link removed as it now appears to be pointing towards a ransomware site)
There's an -xml
switch for the pdftohtml
command which will give image position, dimension and source information.
pdftohtml -xml file.pdf
There is no guarantee in PDF that if an image is reused it will not be a separate image. There is very little image metadata in a PDF file beyond the page location and its actual size on the page. I wrote an article explaining how images are stored inside a PDF at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With