Remove all text from PDF file

Tags:

I am using Ghostscript to convert source PDF file into array of PNG images. Before I convert PDF page into PNG image I would need to extract (delete) all text from PDF so that converted page image would contain all other elements, excluding text.

Can I achieve this with Ghostscript or will I need to look into different tools?

I would also be interested in a tool that can read-save my source PDF removing all the text.

420

asked Jun 20 '14 07:06

Primoz Rome

2 Answers

You can achieve what you want without Ghostscript, simply by using a text editor.

Convert your compressed PDF into one which has (nearly) all PDF objects' contents and streams expanded into a readable form using QPDF:

Click to copy
```
 qpdf --qdf --object-streams=disable input.pdf editable.pdf
```
Open your new editable.pdf file with a text editor (which also gracefully handles any remaining binary blobs inside the PDF such as font or ICC resources).
Search for all occurences of TJ and Tj strings (PDF operators used to show text) inside PDF object streams and replace them with the JT and jT strings respectively (undefined, nonsense PDF operators). Save the file as edited.pdf.
Now convert your edited.pdf to your PNG images as needed.

Note that edited.pdf will still display in most PDF viewers, but the text will be missing as intended. However, it will be easy to restore the text again, by restoring the original TJ/Tj operators and thus reversing any manual modification.

In the "normalized" form created by the qpdf command given above, objects with streams usually look like this (where NNN is an integer number):

Click to copy

NNN 0 obj
<<
   % Here are the key:value pairs of the object dictionary
   /Key1 somevalue1
   /Key2 somevalue2
   % ... (more key:value pairs)
>>
stream
% Here is the content of the object stream
endstream
endobj

An "image stream" has basically the same structure. But the key:value pairs typically contain the following four entries, in any order (where NNN and MMM are integer values giving width and height of the image in pixels):

Click to copy

/Type /XObject
/Subtype /Image
/Width NNN
/Height MMM

Update/Correction

My bad! My original answer contained a repeated typo. I had used tj at places where Tj should have been used. Sorry for any confusion that may have created.

answered Oct 22 '22 12:10

Kurt Pfeifle

Obviously this is not a standard requirement, but it was recently discussed on the #Ghostscript forum on IRC. The channel is logged and you can find the discussion here:

http://ghostscript.com/irclogs/2014/05/21.html

We originally suggested changing the initial text rendering mode to 3 in pdf_ops.ps, but that had no effect on the file as it was using a type 3 font. So we suggested instead altering the definitions of TJ and Tj in the same file. Look at around 15:37 in the log.

answered Oct 22 '22 10:10

KenS

Related questions
                            
                                Django-Weasyprint image issue
                            
                                Generate pdf from Rails 3 - what tool to choose?
                            
                                How to convert iTextPDF Document to Byte Array
                            
                                Drop caps in pdfLaTeX
                            
                                XSL-FO fop. Long text flows into adjacent cells/block, obscuring stuff there
                            
                                Building table dynamically with PDFMake
                            
                                Prawn doesn't draw a horizontal rule
                            
                                Add text to existing PDF document in Python
                            
                                Creating XFA Form and Embedding it to a regular PDF
                            
                                Exporting Google Sheet as PDF with Custom Headers or Footers in Google Apps Script
                            
                                Android fill PDF form
                            
                                How to produce documents (docx or pdf) from SQL Server?
                            
                                php mpdf memory limit error
                            
                                How to Set Document Orientation (for All Pages) in MigraDoc Library?
                            
                                Are there any viable alternatives to wkhtmltopdf on windows, for html to pdf conversion? [closed]
                            
                                How would I convert a HTML5 / CSS3 document to PDF?
                            
                                Silent Printing of PDF From Within Java
                            
                                Python Reportlab PDF - Centering Text on page
                            
                                Add image in header using html-pdf node module
                            
                                PDF - Remove White Margins

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove all text from PDF file

Tags:

pdf-generation

ghostscript