Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract images from pdf using pdfbox

I m trying to extract images from a pdf using pdfbox. The example pdf here

But i m getting blank images only.

The code i m trying:-

public static void main(String[] args) {    PDFImageExtract obj = new PDFImageExtract();     try {         obj.read_pdf();     } catch (IOException ex) {         System.out.println("" + ex);     }  }   void read_pdf() throws IOException {     PDDocument document = null;      try {         document = PDDocument.load("C:\\Users\\Pradyut\\Documents\\MCS-034.pdf");     } catch (IOException ex) {         System.out.println("" + ex);     }     List pages = document.getDocumentCatalog().getAllPages();     Iterator iter = pages.iterator();      int i =1;     String name = null;      while (iter.hasNext()) {         PDPage page = (PDPage) iter.next();         PDResources resources = page.getResources();         Map pageImages = resources.getImages();         if (pageImages != null) {              Iterator imageIter = pageImages.keySet().iterator();             while (imageIter.hasNext()) {                 String key = (String) imageIter.next();                 PDXObjectImage image = (PDXObjectImage) pageImages.get(key);                 image.write2file("C:\\Users\\Pradyut\\Documents\\image" + i);                 i ++;             }         }     }  } 

Thanks

like image 247
Pradyut Bhattacharya Avatar asked Jan 02 '12 20:01

Pradyut Bhattacharya


People also ask

Which is better iText or PDFBox?

One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot.

Is PDFBox free to use?

Bookmark this question. Show activity on this post. PDFbox is that PDFbox is the free version.

What is PDFBox used for?

Apache PDFBox is an open source Java library that can be used to create, render, print, split, merge, alter, verify and extract text and meta-data of PDF files.


2 Answers

Here is code using PDFBox 2.0.1 that will get a list of all images from the PDF. This is different than the other code in that it will recurse through the document instead of trying to get the images from the top level.

public List<RenderedImage> getImagesFromPDF(PDDocument document) throws IOException {         List<RenderedImage> images = new ArrayList<>();     for (PDPage page : document.getPages()) {         images.addAll(getImagesFromResources(page.getResources()));     }      return images; }  private List<RenderedImage> getImagesFromResources(PDResources resources) throws IOException {     List<RenderedImage> images = new ArrayList<>();      for (COSName xObjectName : resources.getXObjectNames()) {         PDXObject xObject = resources.getXObject(xObjectName);          if (xObject instanceof PDFormXObject) {             images.addAll(getImagesFromResources(((PDFormXObject) xObject).getResources()));         } else if (xObject instanceof PDImageXObject) {             images.add(((PDImageXObject) xObject).getImage());         }     }      return images; } 
like image 200
Matt Avatar answered Sep 24 '22 23:09

Matt


The below GetImagesFromPDF java class get all images in 04-Request-Headers.pdf file and save those files into destination folder PDFCopy.

import java.io.File; import java.util.Iterator; import java.util.List; import java.util.Map;  import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;  @SuppressWarnings({ "unchecked", "rawtypes", "deprecation" }) public class GetImagesFromPDF {     public static void main(String[] args) {         try {             String sourceDir = "C:/PDFCopy/04-Request-Headers.pdf";// Paste pdf files in PDFCopy folder to read             String destinationDir = "C:/PDFCopy/";             File oldFile = new File(sourceDir);             if (oldFile.exists()) {             PDDocument document = PDDocument.load(sourceDir);              List<PDPage> list = document.getDocumentCatalog().getAllPages();              String fileName = oldFile.getName().replace(".pdf", "_cover");             int totalImages = 1;             for (PDPage page : list) {                 PDResources pdResources = page.getResources();                  Map pageImages = pdResources.getImages();                 if (pageImages != null) {                      Iterator imageIter = pageImages.keySet().iterator();                     while (imageIter.hasNext()) {                         String key = (String) imageIter.next();                         PDXObjectImage pdxObjectImage = (PDXObjectImage) pageImages.get(key);                         pdxObjectImage.write2file(destinationDir + fileName+ "_" + totalImages);                         totalImages++;                     }                 }             }         } else {             System.err.println("File not exists");         }     } catch (Exception e) {         e.printStackTrace();     } } 

}

like image 37
UdayKiran Pulipati Avatar answered Sep 24 '22 23:09

UdayKiran Pulipati