Images extracted from PDF are horizontally fragmented

Tags:

I have to extract images from corporate PDF files that contain technical drawings. The PDF files conform to a PDF/A format.

I'm using an approach with Apache's pdfbox, which I learned from this question.

/**
 * 
 * @param filename pdf file
 * @param res folder, where images are extracted
 * @throws IOException
 * @throws DocumentException
 */
public class ExtractImages {

    public static void extractImages(String filename, String res)
            throws IOException, DocumentException {
        int pageNo = 0;

        PDDocument document = null;
        document = PDDocument.load(filename);
        List<PDPage> pages = document.getDocumentCatalog().getAllPages();
        Iterator<PDPage> iter = pages.iterator();

        while (iter.hasNext()) {
            pageNo++;
            PDPage page = iter.next();
            PDResources resources = page.getResources();
            Map<String, PDXObjectImage> pageImages = resources.getImages();
            if (pageImages != null) {
                Iterator<String> imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2file(res + "_page_" + pageNo + "_" +     key);  
                }
            }
        }
        if (document != null)
            document.close();
    }
}

My problem now is that for some files the extracted images are horizontally fragmented in up to 3 slices. Since I don't want to splice them together manually, I would be glad, if someone had some advice.

EDIT - APPROACH 1

One solution I thought of was to create folders per image, then put all the fragments in their corresponding folders, iterate over the folders and merge the content. That would require some sorting work on my side, but I think it could work.

String key = (String) imageIter.next();

returns Im<number>, number denotes the order of the images per page. So the fragments in the folders would already be in an order and the merging program could easily figure out which part is on top, etc.

EDIT - APPROACH 2

Another approach I could think of: The fragments have their order in their file names in that pattern pdfname_page_[\d]_Im[\d][\.][tiff|png]. So I could sort the images corresponding to that order and then merge all fragments in a row that have the same width. I checked that fragments and it seems, that nearly all images have different dimensions.

What do you say to these approaches?

EDIT3

Since we ran out of time, my colleague and me had to extract the images by hand. I'm still interested, but I'll have to solve this problem in my free time.

708

asked Nov 08 '12 14:11

mike

1 Answers

The extracted images are fragmented into 3 slices, because the embedded images are too. This is what the PDF generating software most likely did automatically. (It is very rare that, say, an InDesign document designer was doing this on purpose.)

Hence, there is no reliable method which you could use to automatically stitch together the fragments.

What you can try is this -- but only if you have a version of Adobe Acrobat (Pro?) available:

Use the built-in "PDF Optimizer".
In the "Delete Objects" panel, activate the "Detect image fragments and merge them" option.

(Sorry, above menu and UI entries I translated from memory of a German Acrobat Pro installation, so they for sure aren't precisely matching an English UI.)

However, this method will, in my experience, not work very reliably. In most cases of image fragmentation in PDFs it will not work at all. :-(

191

answered Oct 01 '22 00:10

Kurt Pfeifle

Related questions
                            
                                How can cerr (c++) turn up in Stdout (Java)?
                            
                                Cross Language Messaging [closed]
                            
                                Interface binding in Eclipse
                            
                                Weblogic custom deployer (if really needed)
                            
                                Updating graphics
                            
                                Java "workflow" design [closed]
                            
                                Javaagent reports "redefineClasses is not supported in this environment"
                            
                                301/302 Redirect not working in Android (work differently in different versions)
                            
                                Reading a PDF Document in Android
                            
                                Hibernate session within inner class
                            
                                Needs clarity on hibernate second level cache
                            
                                Java Annotation that is only valid for a classes implementing a specific interface?
                            
                                login , remember me, application using java servlet and jsp
                            
                                How I can change JMenuItem alignment to right align
                            
                                DataGrid in GWT, can't clear it
                            
                                Lists with Java Annotations
                            
                                Java's lack of template inheritance is causing major code duplication headaches in Android. Any solutions?
                            
                                Generating an mp3/wav/ogg Waveform using Java
                            
                                ActivityUnitTestCase getActionBar() returns null
                            
                                Does Thrift have async server side method definitions for Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Images extracted from PDF are horizontally fragmented

Tags:

java

image

pdf

extract

mike

People also ask

1 Answers

Kurt Pfeifle

Recent Activity

Donate For Us