Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to redact PDF areas with PDFBox by position?

Tags:

java

pdf

pdfbox

The Context

Currently, I have a solution where I loop through a PDF and draw black rectangles throughout it.

So I already have a PDRectangle list representing the right areas I need to fill/cover on the pdf, hiding all the texts I want to.

The Problems

Problem number 1: The text underneath the black rectangle is easily copied, searchable, or extracted by other tools.

I solved this by flattening my pdf (converting it into an image so that it becomes a single layer document and the black rectangle can no longer be tricked). Same solution as described here: Disable pdf-text searching with pdfBox

This is not an actual redacting, it's more like a workaround. Which leads me to

Problem number 2:

My final PDF becomes an image document, where I lose all the pdf properties, including searching, copying... also it's a much slower process. I wanted to keep all the pdf properties while the redacted areas are not readable by any means.

What I want to accomplish

That being said, I'd like to know if it is possible and how I could do an actual redacting, blacken out rectangles areas since I already have all the positions I need, with PDFBox, keeping the pdf properties and not allowing the redacted area to be read.

Note: I'm aware of the problems PDFBox had with the old ReplaceText function, but here I have the positions I need to make sure I'd blank precisely the areas I need.

Also, I'm accepting other free library suggestions.

Technical Specification:

PDFBox 2.0.21
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.4, 16gb, x86_64

My Code

This is how I draw the black rectangle:

private void draw(PDPage page, PDRectangle hitPdRectangle) throws IOException {

    PDPageContentStream content = new PDPageContentStream(pdDocument, page,
        PDPageContentStream.AppendMode.APPEND, false, false);
    content.setNonStrokingColor(0f);
    
    content.addRect(hitPdRectangle.getLowerLeftX(), 
        hitPdRectangle.getLowerLeftY()  -0.5f, 
        hitPdRectangle.getUpperRightX() - hitPdRectangle.getLowerLeftX(), 
        hitPdRectangle.getUpperRightY() - hitPdRectangle.getLowerLeftY());
    
    content.fill();
    content.close();
}

This is how I convert it into an Image PDF:

private PDDocument createNewRedactedPdf() throws IOException {
    PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);

    PDDocument redactedDocument = new PDDocument();

    for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
        BufferedImage image = pdfRenderer.renderImageWithDPI(pageIndex, 200);

        String formatName = "jpg";
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ImageIO.write(image, formatName, baos);

        byte[] bimg = baos.toByteArray();

        PDPage page = pdDocument.getPage(pageIndex);
        float pageWidth  = page.getMediaBox().getWidth();
        float pageHeight = page.getMediaBox().getHeight();

        PDPage pageDraw = new PDPage(new PDRectangle(pageWidth, pageHeight));
        redactedDocument.addPage(pageDraw);
        String imgSuffixName = pageIndex + "." + formatName;
        PDImageXObject img = PDImageXObject.createFromByteArray(redactedDocument, bimg,
            pdDocument.getDocument().getDocumentID() + imgSuffixName);

        try (PDPageContentStream contentStream
                 = new PDPageContentStream(redactedDocument, pageDraw, PDPageContentStream.AppendMode.OVERWRITE, false)) {

            contentStream.drawImage(img, 0, 0, pageWidth, pageHeight);
        }
    }

    return redactedDocument;
}

Any thoughts?

like image 996
Thales Valias Avatar asked Nov 17 '20 12:11

Thales Valias


People also ask

How do you Redact part of a document?

Choose Tools > Redact. On the Edit menu, choose Redact Text & Images. Select the text or image in a PDF, right-click, and select Redact. Select the text or image in a PDF, choose Redact in the floating context-menu.

Can a redacted PDF be unredacted?

Once a file is saved with redactions applied, there's no way to get the information back. If you overwrite the original PDF with the redacted version, the redacted information is gone forever.


1 Answers

What you want to have, a true redaction feature, is possible to implement based on PDFBox but it requires a lot of coding on top of it (similar to the pdfSweep add-on implemented on top of iText).

In particular you have found out yourself that it does not suffice to draw black rectangles over the areas to redact as text extraction or copy&paste from a viewer usually completely ignores whether text is visible or covered by something.

Thus, in the code you do have to find the actual instruction drawing the text to redact and remove them. But you cannot simply remove them without replacement, otherwise additional text on the same line may be moved by your redaction.

But you cannot simply replace them with the same number of spaces or a move-right by the width of the removed text: Just consider the case of a table you want to redact a column from with only "yes" and "no" entries. If after redaction a text extractor returns three spaces where there was a "yes" and two spaces where there was a "no", anyone looking at those results knows what there was in the redacted area.

You also have to clean up instructions around the actual text drawing instruction. Consider the example of the column to redact with "yes"/"no" information again, but this time for more clarity the "yes" is drawn in green and the "no" in red. If you only replace the text drawing instructions, someone with an extractor that also extracts attributes like the color will immediately know the redacted information.

In case of tagged PDFs, the tag attributes have to be inspected too. There in particular is an attribute ActualText which contains the actual text represented by the tagged instructions (in particular for screen readers). If you only remove the text drawing instructions but leave the tags with their attributes, anyone reading using a screen reader may not even realize that you tried to redact something as his screen reader reads the complete, original text to him.

For a proper redaction, therefore, you essentially have to interpret all the current instructions, determine the actual content they draw, and create a new set of instructions which draws the same content without unnecessary extra instructions which may give away something about the redacted content.

And here we only looked at redacting the text; redacting vector and bitmap graphics on a PDF page has a similar amount of challenges to overcome for proper redaction.

...

Thus, the code required for actual redaction is beyond the scope of a stack overflow answer. Nonetheless, the items above may help someone implementing a redactor not to fall into typical traps of too naive redaction code.

like image 131
mkl Avatar answered Nov 10 '22 00:11

mkl