Is it possible to redact PDF areas with PDFBox by position?

Tags:

The Context

Currently, I have a solution where I loop through a PDF and draw black rectangles throughout it.

So I already have a PDRectangle list representing the right areas I need to fill/cover on the pdf, hiding all the texts I want to.

The Problems

Problem number 1: The text underneath the black rectangle is easily copied, searchable, or extracted by other tools.

I solved this by flattening my pdf (converting it into an image so that it becomes a single layer document and the black rectangle can no longer be tricked). Same solution as described here: Disable pdf-text searching with pdfBox

This is not an actual redacting, it's more like a workaround. Which leads me to

Problem number 2:

My final PDF becomes an image document, where I lose all the pdf properties, including searching, copying... also it's a much slower process. I wanted to keep all the pdf properties while the redacted areas are not readable by any means.

What I want to accomplish

That being said, I'd like to know if it is possible and how I could do an actual redacting, blacken out rectangles areas since I already have all the positions I need, with PDFBox, keeping the pdf properties and not allowing the redacted area to be read.

Note: I'm aware of the problems PDFBox had with the old ReplaceText function, but here I have the positions I need to make sure I'd blank precisely the areas I need.

Also, I'm accepting other free library suggestions.

Technical Specification:

PDFBox 2.0.21
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64

My Code

This is how I draw the black rectangle:

private void draw(PDPage page, PDRectangle hitPdRectangle) throws IOException {

    PDPageContentStream content = new PDPageContentStream(pdDocument, page,
        PDPageContentStream.AppendMode.APPEND, false, false);
    content.setNonStrokingColor(0f);
    
    content.addRect(hitPdRectangle.getLowerLeftX(), 
        hitPdRectangle.getLowerLeftY()  -0.5f, 
        hitPdRectangle.getUpperRightX() - hitPdRectangle.getLowerLeftX(), 
        hitPdRectangle.getUpperRightY() - hitPdRectangle.getLowerLeftY());
    
    content.fill();
    content.close();
}

This is how I convert it into an Image PDF:

private PDDocument createNewRedactedPdf() throws IOException {
    PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);

    PDDocument redactedDocument = new PDDocument();

    for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
        BufferedImage image = pdfRenderer.renderImageWithDPI(pageIndex, 200);

        String formatName = "jpg";
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ImageIO.write(image, formatName, baos);

        byte[] bimg = baos.toByteArray();

        PDPage page = pdDocument.getPage(pageIndex);
        float pageWidth  = page.getMediaBox().getWidth();
        float pageHeight = page.getMediaBox().getHeight();

        PDPage pageDraw = new PDPage(new PDRectangle(pageWidth, pageHeight));
        redactedDocument.addPage(pageDraw);
        String imgSuffixName = pageIndex + "." + formatName;
        PDImageXObject img = PDImageXObject.createFromByteArray(redactedDocument, bimg,
            pdDocument.getDocument().getDocumentID() + imgSuffixName);

        try (PDPageContentStream contentStream
                 = new PDPageContentStream(redactedDocument, pageDraw, PDPageContentStream.AppendMode.OVERWRITE, false)) {

            contentStream.drawImage(img, 0, 0, pageWidth, pageHeight);
        }
    }

    return redactedDocument;
}

Any thoughts?

996

asked Nov 17 '20 12:11

Thales Valias

1 Answers

What you want to have, a true redaction feature, is possible to implement based on PDFBox but it requires a lot of coding on top of it (similar to the pdfSweep add-on implemented on top of iText).

In particular you have found out yourself that it does not suffice to draw black rectangles over the areas to redact as text extraction or copy&paste from a viewer usually completely ignores whether text is visible or covered by something.

Thus, in the code you do have to find the actual instruction drawing the text to redact and remove them. But you cannot simply remove them without replacement, otherwise additional text on the same line may be moved by your redaction.

But you cannot simply replace them with the same number of spaces or a move-right by the width of the removed text: Just consider the case of a table you want to redact a column from with only "yes" and "no" entries. If after redaction a text extractor returns three spaces where there was a "yes" and two spaces where there was a "no", anyone looking at those results knows what there was in the redacted area.

You also have to clean up instructions around the actual text drawing instruction. Consider the example of the column to redact with "yes"/"no" information again, but this time for more clarity the "yes" is drawn in green and the "no" in red. If you only replace the text drawing instructions, someone with an extractor that also extracts attributes like the color will immediately know the redacted information.

In case of tagged PDFs, the tag attributes have to be inspected too. There in particular is an attribute ActualText which contains the actual text represented by the tagged instructions (in particular for screen readers). If you only remove the text drawing instructions but leave the tags with their attributes, anyone reading using a screen reader may not even realize that you tried to redact something as his screen reader reads the complete, original text to him.

For a proper redaction, therefore, you essentially have to interpret all the current instructions, determine the actual content they draw, and create a new set of instructions which draws the same content without unnecessary extra instructions which may give away something about the redacted content.

And here we only looked at redacting the text; redacting vector and bitmap graphics on a PDF page has a similar amount of challenges to overcome for proper redaction.

...

Thus, the code required for actual redaction is beyond the scope of a stack overflow answer. Nonetheless, the items above may help someone implementing a redactor not to fall into typical traps of too naive redaction code.

131

answered Nov 10 '22 00:11

mkl

Related questions
                            
                                Intellij 'mvn' is not recognized as an internal or external command
                            
                                JavaScript error: resource://gre/modules/XULStore.jsm, line 66: Error: Can't find profile directory error using GeckoDriver Firefox and Selenium
                            
                                Injecting ViewModel using DaggerHilt fails to to compile
                            
                                Spring boot entity many to one mappping in embaded object
                            
                                Java Constants inheritance [duplicate]
                            
                                How do I convert an array of integers to binary?
                            
                                Pass data from a SOAP handler to a webservice server Class
                            
                                What is the difference between JDBC API and PostgreSQL Driver?
                            
                                Add file filters to JavaFx Filechooser in Jython and parametrize them
                            
                                JDK 15 Sealed Classes - how to use across packages?
                            
                                The length of a compressed Java String is not equal to the content-length when it is sent as a WebSocket message
                            
                                How to prevent modification of mutable object
                            
                                tzupdater failures with 2020b & 2020c
                            
                                How to connect to docker.sock using Netty?
                            
                                How to map Java/Kotlin string array and Postgres SQL array with JPA and hibernate
                            
                                How to Customise example value of request body and execute it on swagger-ui with springdoc-open-api
                            
                                How to effectively destroy 'session' in Java Servlet?
                            
                                How do I loop through an enum in Java? [duplicate]
                            
                                Swing UIManager.getColor() keys
                            
                                What is the difference between GenericServlet, HttpServlet and a Servlet?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to redact PDF areas with PDFBox by position?

Tags:

java

pdf

pdfbox

Thales Valias

People also ask

1 Answers

mkl

Recent Activity

Donate For Us