Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you programmatically redact PDF FIles?

Adobe Acrobat has the ability to redact PDF files (that is, actually remove the information, rather than simply drawing a black box on top of it). I would like to use this feature programmatically. To redact using the GUI you select the Mark for Redaction Tool, draw it over the text to be redacted, then Apply Redactions.

Is there any way to do this programmatically, either through AppleScript or some other way?

I know the (X,y) location of the text to be redacted.

Thanks!

like image 297
vy32 Avatar asked May 18 '11 12:05

vy32


2 Answers

In order to properly redact a PDF, you need to Alter The Content Stream. This is Very Hard.

If you can find the portion of the content stream that draws the text you want removed, you're halfway there.

The other half is figuring out how to change the content stream such that you don't modify the rest of the document. If the next text draw operator is proceeded by a "tm" command (set the text matrix, which absolutely positions the next piece of text), it's easy. If not... you have to calculate the exact width of the text you're replacing (several different PDF libraries can do this), and alter the drawing commands to skip over that much stuff.

For Example:

BT
/F1 10 Tf
1 0 0 1 30 720 Tm
(Here's some text, and you only want to REDACT that upper case "redact" over there)Tj
*
(This text is positioned relative to the previous line)Tj
1 0 0 1 30 650 Tm
(This text is positioned absolutely, starting at 30, 650)Tj

So you'd have to break up that first (...)Tj line into (Here's some text, and you only want to)Tj, N 0 Td, and (that upper case "redact" over there)Tj... where the 'N' properly adjusts the position of the following text drawing operation such that it lands in EXACTLY THE SAME SPOT. So you'd need to know the precise width of " REDACT " using the font resource /F1 (whatever that turned out to be), sized to 10 points.

Just to make your life more exciting, you have to worry about kerned text too. You can provide little spacing adjustments inline with text thusly:

(This is taken from the first text drawn in the PDF Spec)

[(Adobe Sys)5(t)1(ems Inc)5(orporated)5( 20)5(08 \226 All rights)5( reser)-9(ved)]TJ

To properly redact "Incorporated", you need to determine that it's been split across two strings, and adjust the positioning of the string following it so it's in Exactly The Same Spot.

And strings can be <DEADBEEF> hex values rather than (plain old ascii).

Get the idea? And I haven't covered all the possibilities here, just the most common ones.

Like I said: This is Very Hard.


There's an acrobat plugin called Appligent Redax (no connection) that lets you draw annotations (or generate them via templates, regex, etc) and then run their code to handle the redaction. It should be possible to programmatically create their annotations and perhaps even activate their plugin: JS in a document can run a menu item.

like image 128
Mark Storer Avatar answered Oct 12 '22 19:10

Mark Storer


You can use GroupDocs.Redaction for .NET to programmatically redact text in the PDF documents. You can perform the exact phrase, case-sensitive and regular expression redaction of the text. This is how you can perform the exact phrase redaction.

using (Document doc = Redactor.Load("D:\\candy.pdf"))
{
     doc.RedactWith(new ExactPhraseRedaction("candy", new ReplacementOptions("[redacted]")));
     // Save the document to "*_Redacted.*" file.
     doc.Save(new SaveOptions() { AddSuffix = true, RasterizeToPDF = false }); 
} 

Disclosure: I work as Developer Evangelist at GroupDocs.

like image 32
Usman Aziz Avatar answered Oct 12 '22 20:10

Usman Aziz