"black stain" when extracting page to image on PDFBox 2.0.4

Tags:

Using PDFBox 2.0.4 to extract pages as image, my result page contains multiple "black holes" as shown in the following screen :

enter image description here

This happen only for this PDF and few others : http://www.filedropper.com/selection_3

Here is a simple code (with JavaFX) to reproduce the problem (change the File path after downloading the PDF) :

public class PDFExtractionTest extends Application {

    @Override
    public void start(Stage primaryStage) throws Exception {
        FileInputStream inputStream = new FileInputStream(new File("C:\\Users\\John\\Desktop\\selection.pdf"));
        PDDocument document = PDDocument.load(inputStream);
        PDFRenderer pdfRenderer = new PDFRenderer(document);
        BufferedImage bufferedImage = pdfRenderer.renderImage(1);
        Image fxImage = SwingFXUtils.toFXImage(bufferedImage, null);

        BorderPane borderPane = new BorderPane();
        ImageView imageView = new ImageView(fxImage);

        borderPane.setCenter(imageView);

        primaryStage.setScene(new Scene(borderPane, 1024, 768));
        primaryStage.show();
    }

     public static void main(String[] args) throws FileNotFoundException {
         launch(args);
     }
}

Here are my dependencies :

pdfbox 2.0.4
jai-imageio-jpeg2000 1.3.0 (Prevent error : Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed)
levigo-jbig2-imageio 1.6.5 (Prevent error : Cannot read JBIG2 image: jbig2-imageio is not installed)

In the logs I have this, but I don't know if it's the cause of the problem. How can I fix it ?

févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Times-Bold
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
AVERTISSEMENT: No Unicode mapping for .notdef (9) in font Helvetica
févr. 01, 2017 11:20:51 AM org.apache.pdfbox.rendering.Type1Glyph2D getPathForCharacterCode
AVERTISSEMENT: No glyph for 9 (.notdef) in font Helvetica

Did I miss something in the code or should I report a bug ?

314

asked Feb 01 '17 10:02

Rizen

1 Answers

This is a longtime problem (see PDFBOX-1752). The bug is in JAI, not in PDFBox. The "No unicode..." is irrelevant here, this is only relevant for text extraction.

Check out the jai-imageio-jpeg2000 project, then change the file StdEntropyDecoder.java as in this commit (expanded from this pull request). Build the project and either reference version 1.3.1-SNAPSHOT in your maven pom.xml or copy the jar file into your classpath.

If the jai-imageio-jpeg2000 project team releases a new version that contains that pull request, then you'll no longer have to build yourself.

Additional keywords: black inkblot, black splodge

145

answered Nov 14 '22 22:11

Tilman Hausherr

Related questions
                            
                                Graphics2D wrapper for 2d game engine
                            
                                Separate Back Navigation for a Tabbed View Pager in Android
                            
                                Apache Camel: do not trigger route if previous route run is not complete
                            
                                Most efficient way to compute a polynomial
                            
                                RxJava Multithreading with Realm - Realm access from incorrect thread
                            
                                Spring Boot / Tomcat on AWS Elastic Beanstalk only showing 404 page
                            
                                H2 console error: No suitable driver found for 08001/0
                            
                                Why doesn't java.util.Optional implement Iterable?
                            
                                Calling constructor of generic type?
                            
                                Maven/Retrolambda: how to detect dependencies on Java 8 classes
                            
                                Convert json to Map.Entry object with Gson
                            
                                Automated-refactoring tool to find similar duplicate source code for Java/Javascript? [closed]
                            
                                Simple name and qualified name
                            
                                How to include test classes into shadowJar?
                            
                                How to get information about a connected/paired Bluetooth device in Java?
                            
                                How can we replace tab character with white space for existing Java code in Eclipse?
                            
                                How to handle Activemq's max frame size exception with failover transport
                            
                                Spring ComponentScan excludeFilters annotation not working in Spring Boot Test context
                            
                                Java - unable to create directory with 777 permission (has 775 instead) [duplicate]
                            
                                JavaFX WebView font issue on Mac

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"black stain" when extracting page to image on PDFBox 2.0.4

Tags:

java

pdf

pdfbox

jpeg2000

Rizen

People also ask

1 Answers

Tilman Hausherr

Recent Activity

Donate For Us