Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting searchable PDF to a non-searchable PDF

I have a PDF which is searchable and I need to convert it into a non-searchable one.

I tried using Ghostscript and change it to JPEG and then back to PDF which does the trick but the file size is way too large and not acceptable.

I tried using Ghostscript to convert the PDF to PS first and then PDF which does the trick as well but the quality is not good enough.

gswin32.exe -q -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pswrite -r1000 -sOutputFile=out.ps in.pdf
gswin32.exe -q -dNOPAUSE -dBATCH -dSAFER -dDEVICEWIDTHPOINTS=596 -dDEVICEHEIGHTPOINTS=834 -dPDFSETTINGS=/ebook -sDEVICE=pdfwrite -sOutputFile=out.pdf out.ps

Is there a way to give a good quality to the PDF?

Alternatively is there an easier way to convert a searchable PDF to a non-searchable one?

like image 306
Steven Yong Avatar asked Feb 02 '12 03:02

Steven Yong


People also ask

How do I make a PDF non-searchable online?

Step by Step InstructionsSelect “Edit & Create Profiles” to open Profile Manager. Create a copy of the Non-Searchable PDF system profile. It already creates non-searchable PDF files; we only need to enable the PDF/A compliance option in our copy.

How do I change an unsearchable PDF to a searchable PDF?

Run Adobe Acrobat. Open scanned PDF with Adobe. Go to Tools>Enhance Scans>Recognize Text>In this File, start processing OCR on the scanned PDF. Once ready, save the searchable PDF file.

How do I remove OCR from PDF?

Choose Tools > Edit PDF. To turn off automatic OCR, do the following: In the right pane, clear the Recognize text checkbox. From next time, Acrobat won't automatically run OCR.

What is the difference between a PDF and a searchable PDF?

Normally, you create the file in your software and then "print" it to a PDF printer. This converts the file to PDF format. These PDF files are text-based PDF, meaning that they retain the text and formatting of the original. Text-based PDF files are searchable because they contain real text.


2 Answers

You can use Ghostscript to achieve that. You need 2 steps:

  1. Convert the PDF to a PostScript file, which has all used fonts converted to outline shapes. The key here is the -dNOCACHE paramenter:

    gs -o somepdf.ps -dNOCACHE -sDEVICE=pswrite somepdf.pdf
  2. Convert the PS back to PDF (and, maybe delete the intermediate PS again):

    gs -o somepdf-with-outlines.pdf -sDEVICE=pdfwrite somepdf.ps
    rm somepdf.ps

Note, that the resulting PDF will very likely be larger than the original one. (And, without additional command line parameters, all images in the original PDF will likely also be converted according to Ghostscript builtin defaults, unless you add more command line parameters to do otherwise. But the quality should be better than your own attempt to use Ghostscript...)


Update

Apparently, from version 9.15 (to be released during September/October 2014), Ghostscript will support a new command line parameter:

 -dNoOutputFonts

which will cause the output devices pdfwrite, ps2write and eps2write "to 'flatten' glyphs into 'basic' marking operations (rather than writing fonts to the output)".

This means that the above two steps can be avoided, and the desired result be achieved with a single command:

 gs -o somepdf-with-outlines.pdf -dNoOutputFonts -sDEVICE=pdfwrite somepdf.pdf

Caveats: I've tested this with a few input files using a self-compiled Ghostscript based on current Git sources. It worked flawlessly in each case.

like image 51
Kurt Pfeifle Avatar answered Oct 08 '22 07:10

Kurt Pfeifle


a possible way to produce non-searchable vector pdf from a searchable vector pdf is

  1. burst pdf in its single pages

    pdftk file.pdf burst

  2. convert any single page in svg with

    pdftocairo

    • http://poppler.freedesktop.org/

contained into poppler utils

for f in *.pdf; do pdftocairo -svg $f; done

3 . delete ALL pdf in folder

4 . then, with batikrasterizer

  • http://xmlgraphics.apache.org/batik/tools/rasterizer.html

re-convert ALL svg to pdf (this time the resulting pdfs will be kept vectorial, but without to be searchable)

java -jar ./batik-rasterizer.jar -m application/pdf *.svg

final step: join all resulting single page pd in one multipage pdf file

pdftk *.pdf cat output out.pdf
like image 29
Dingo Avatar answered Oct 08 '22 06:10

Dingo