Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :
This happen only for this PDF and few others : http://www.filedropper.com/selection_3
Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :
public class PDFExtractionTest extends Application {
@Override
public void start(Stage primaryStage) throws Exception {
FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
PDDocument document = PDDocument.load(inputStream);
PDFRenderer pdfRenderer = new PDFRenderer(document);
BufferedImage bufferedImage = pdfRenderer.renderImage(1);
Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);
BorderPane borderPane = new BorderPane();
ImageView imageView = new ImageView(fxImage);
borderPane.setCenter(imageView);
primaryStage.setScene(new Scene(borderPane, 1024, 768));
primaryStage.show();
}
public static void main(String[] args) throws FileNotFoundException {
launch(args);
}
}
Here are my dependencies :
In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica
Did I miss something in the code or should I report a bug ?
Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative ...
Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.
This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.
Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java
as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.
If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.
Additional keywords: black inkblot, black splodge
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With