iText or iTextSharp rudimentary text edit

Tags:

I can extract text from pages in a PDF in many ways:

String pageText = PdfTextExtractor.GetTextFromPage(reader, i);

This can be used to get any text on a page.

Alternatively:

byte[] contentBytes = iTextSharp.text.pdf.parser.ContentByteUtils.GetContentBytesForPage(reader, i);

Possibilities are endless.

Now I want to remove/redact a certain word, e.g. explicit words, sensitive information (putting black boxes over them obviously is a bad idea :) or whatever from the PDF (which is simple and text only). I can find that word just fine using the approach above. I can count its occurrences etc...

I do not care about layout, or the fact that PDF is not really meant to be manipulated in this way.

I just wish to know if there is a mechanism that would allow me to manipulate the raw content of my PDF in this way. You could say I'm looking for "SetContentBytesForPage()" ...

558

asked Feb 07 '14 00:02

Kris

2 Answers

If you want to change the content of a page, it isn't sufficient to change the content stream of a page. A page may contain references to Form XObjects that contain content that you want to remove.

A secondary problem consists of images. For instance: suppose that your document consists of a scanned document that has been OCR'ed. In that case, it isn't sufficient to remove the (vector) text, you'll also need to manipulate the (pixel) text in the image.

Assuming that your secondary problem doesn't exist, you'll need a double approach:

get the content from the page as text to detect in which pages there are names or words you want to remove.
recursively loop over all the content streams to find that text and to rewrite those content streams without that text.

From your question, I assume that you have already solved problem 1. Solving problem 2 isn't that trivial. In chapter 15 of my book, I have an example where extracting text returns "Hello World", but when you look inside the content stream, you see:

BT
/F1 12 Tf
88.66 367 Td
(ld) Tj
-22 0 Td
(Wor) Tj
-15.33 0 Td
(llo) Tj
-15.33 0 Td
(He) Tj
ET

Before you can remove "Hello World" from this stream snippet, you'll need some heuristics so that your program recognizes the text in this syntax.

Once you've found the text, you need to rewrite the stream. For inspiration, you can take a look at the OCG remover functionality in the itext-xtra package.

Long story short: if your PDFs are relatively simple, that is: the text can be easily detected in the different content stream (page content and Form XObject content), then it's simply a matter of rewriting those streams after some string manipulations.

I've made you a simple example named ReplaceStream that replaces "Hello World" with "HELLO WORLD" in a PDF.

public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
    PdfReader reader = new PdfReader(src);
    PdfDictionary dict = reader.getPageN(1);
    PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream = (PRStream)object;
        byte[] data = PdfReader.getStreamBytes(stream);
        stream.setData(new String(data).replace("Hello World", "HELLO WORLD").getBytes());
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();
}

Some caveats:

I check if object is a stream. It could also be an array of streams. In that case, you need to loop over that array.
I don't check if there are form XObjects defined for the page.
I assume that Hello World can be easily detected in the PDF Syntax.
...

In real life, PDFs are never that simple and the complexity of your project will increase dramatically with every special feature that is used in your documents.

answered Nov 09 '22 23:11

Bruno Lowagie

The C# equivalent of the code by Bruno:

static void manipulatePdf(String src, String dest)
    {
        PdfReader reader = new PdfReader(src);
        PdfDictionary dict = reader.GetPageN(1);
        PdfObject pdfObject = dict.GetDirectObject(PdfName.CONTENTS);
        if (pdfObject.IsStream()) {
            PRStream stream = (PRStream)pdfObject;
            byte[] data = PdfReader.GetStreamBytes(stream);
            stream.SetData(System.Text.Encoding.ASCII.GetBytes(System.Text.Encoding.ASCII.GetString(data).Replace("Hello World", "HELLO WORLD")));
        }
        FileStream outStream = new FileStream(dest, FileMode.Create);
        PdfStamper stamper = new PdfStamper(reader, outStream);
        reader.Close();
    }

I'll update this if it would turn out to still contain errors.

answered Nov 09 '22 22:11

Kris

Related questions
                            
                                Significance of a PATH explained
                            
                                What is the difference between pipeline.invoke and powershell.invoke?
                            
                                Accessing anonymous type variables
                            
                                Winforms app still crashes after unhandled exception handler
                            
                                Is it possible to know how many objects of a type are loaded in Visual Studio 2013
                            
                                How to populate two separate arrays from one comma-delimited list?
                            
                                Transforming coordinates of one rectangle to another rectangle
                            
                                Index was outside the bounds of array when using List<Func<T,object>>
                            
                                How to convert docx to html file using open xml with formatting
                            
                                Returning a string from a C# DLL with Unmanaged Exports to Inno Setup script
                            
                                One ViewModel for UserControl and Window or separate ViewModels
                            
                                How to Bind to a Custom Controls Button Visibility from Within Another Control
                            
                                Prevent Sql injection in nhibernate
                            
                                Unable to cast the object type "System.Web.Mvc.HtmlHelper` 1 [System.Object] "to type" System.Web.Mvc.HtmlHelper "
                            
                                Split array with LINQ
                            
                                c# enum covariance doesn't work
                            
                                Microsoft.Web.Infrastructure is not being built to the bin directory on one of my TFS build servers
                            
                                Task.ConfigureAwait(false) in Xamarin - safe to use / recommended to use?
                            
                                Exception filter not working in web api
                            
                                How to get the intellisense in visual studio 2012 for AngularJS [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

iText or iTextSharp rudimentary text edit

Tags:

c#

pdf

itext

Kris

People also ask

2 Answers

Bruno Lowagie

Kris

Recent Activity

Donate For Us