Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Images extracted from PDF are horizontally fragmented

I have to extract images from corporate PDF files that contain technical drawings. The PDF files conform to a PDF/A format.

I'm using an approach with Apache's pdfbox, which I learned from this question.

/**
 * 
 * @param filename pdf file
 * @param res folder, where images are extracted
 * @throws IOException
 * @throws DocumentException
 */
public class ExtractImages {

    public static void extractImages(String filename, String res)
            throws IOException, DocumentException {
        int pageNo = 0;

        PDDocument document = null;
        document = PDDocument.load(filename);
        List<PDPage> pages = document.getDocumentCatalog().getAllPages();
        Iterator<PDPage> iter = pages.iterator();

        while (iter.hasNext()) {
            pageNo++;
            PDPage page = iter.next();
            PDResources resources = page.getResources();
            Map<String, PDXObjectImage> pageImages = resources.getImages();
            if (pageImages != null) {
                Iterator<String> imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2file(res + "_page_" + pageNo + "_" +     key);  
                }
            }
        }
        if (document != null)
            document.close();
    }
}

My problem now is that for some files the extracted images are horizontally fragmented in up to 3 slices. Since I don't want to splice them together manually, I would be glad, if someone had some advice.

EDIT - APPROACH 1

One solution I thought of was to create folders per image, then put all the fragments in their corresponding folders, iterate over the folders and merge the content. That would require some sorting work on my side, but I think it could work.

String key = (String) imageIter.next();

returns Im<number>, number denotes the order of the images per page. So the fragments in the folders would already be in an order and the merging program could easily figure out which part is on top, etc.

EDIT - APPROACH 2

Another approach I could think of: The fragments have their order in their file names in that pattern pdfname_page_[\d]_Im[\d][\.][tiff|png]. So I could sort the images corresponding to that order and then merge all fragments in a row that have the same width. I checked that fragments and it seems, that nearly all images have different dimensions.

What do you say to these approaches?

EDIT3

Since we ran out of time, my colleague and me had to extract the images by hand. I'm still interested, but I'll have to solve this problem in my free time.

like image 708
mike Avatar asked Nov 08 '12 14:11

mike


People also ask

How do I extract an image from a PDF without losing quality?

In preferences/general check the box that says 'use fixed resolution for snapshot tool' and set the resolution to your liking e.g., 300ppi or even higher. Then take a snapshot (tools/select & zoom/snapshot tool) and it will copy a high res copy to your clipboard. Then paste it from your clipboard where you want.

Can images be extracted from PDF?

Right-click the selected image, and select Extract Image. The Save As dialog appears. Browse to a location, and specify a filename for the image. Save as type: Specify an image file format: JPG, TIF or PNG.

How do I pull images from a PDF?

Right-click the document and choose Select Tool from the pop-up menu. Drag to select text or click to select an image. Right-click the selected item and choose Copy. The image is now in your clipboard.


1 Answers

The extracted images are fragmented into 3 slices, because the embedded images are too. This is what the PDF generating software most likely did automatically. (It is very rare that, say, an InDesign document designer was doing this on purpose.)

Hence, there is no reliable method which you could use to automatically stitch together the fragments.

What you can try is this -- but only if you have a version of Adobe Acrobat (Pro?) available:

  • Use the built-in "PDF Optimizer".
  • In the "Delete Objects" panel, activate the "Detect image fragments and merge them" option.

(Sorry, above menu and UI entries I translated from memory of a German Acrobat Pro installation, so they for sure aren't precisely matching an English UI.)

However, this method will, in my experience, not work very reliably. In most cases of image fragmentation in PDFs it will not work at all. :-(

like image 191
Kurt Pfeifle Avatar answered Oct 01 '22 00:10

Kurt Pfeifle