Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"black stain" when extracting page to image on PDFBox 2.0.4

Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :

enter image description here

This happen only for this PDF and few others : http://www.filedropper.com/selection_3

Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :

public class PDFExtractionTest extends Application {

    @Override
    public void start(Stage primaryStage) throws Exception {
        FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
        PDDocument document = PDDocument.load(inputStream);
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        BufferedImage bufferedImage = pdfRenderer.renderImage(1);
        Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);

        BorderPane borderPane = new BorderPane();
        ImageView imageView = new ImageView(fxImage);

        borderPane.setCenter(imageView);

        primaryStage.setScene(new Scene(borderPane, 1024, 768));
        primaryStage.show();
    }

     public static void main(String[] args) throws FileNotFoundException {
         launch(args);
     }
}

Here are my dependencies :

  • pdfbox 2.0.4
  • jai-imageio-jpeg2000 1.3.0 (Prevent error : Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed)
  • levigo-jbig2-imageio 1.6.5 (Prevent error : Cannot read JBIG2 image: jbig2-imageio is not installed)

In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?

févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica

Did I miss something in the code or should I report a bug ?

like image 314
Rizen Avatar asked Feb 01 '17 10:02

Rizen


People also ask

Is PDFBox free for commercial use?

Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative ...

Is PDFBox thread safe?

Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.


1 Answers

This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.

Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.

If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.

Additional keywords: black inkblot, black splodge

like image 145
Tilman Hausherr Avatar answered Nov 14 '22 22:11

Tilman Hausherr