Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize PDF files (with Ghostscript or other)

Is Ghostscript the best option if you want to optimize a PDF file and reduce the file size?

I need to store alot of PDF files and therefore I need to optimize and reduce the file size as much as possible

Does anyone have any experience with Ghostscript and/or other?

command line

exec('gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -sOutputFile='.$file_new.' '.$file); 
like image 318
clarkk Avatar asked May 04 '12 13:05

clarkk


People also ask

What is the best way to optimize a PDF?

Choose File > Save As Other > Optimized PDF, or Go to Tools > Optimize PDF, and then click Advanced Optimization.

How do I optimize a PDF without losing quality?

Open your PDF file in Preview. It should be the default option, but in case it isn't, Right Click on the PDF file, select Open with > Preview. Then, click File > Export, and in the Quartz Filter drop-down box, select Reduce File Size. The software will automatically reduce the size of the PDF file.

How do I optimize multiple PDF files?

Double click on the "Action Wizard" tool icon to open the tool controls. Click on the "New Action" icon on the "Action Wizard" toolbar to create a new batch processing "action". Double click on a "Save" command:. Now we need to select an output folder where to save the optimized PDF files.

Why would you use the Optimize PDF feature?

Most PDF editing/creation software offer an option called “Fast Web View” or “Optimize” that allows PDF files to display the first few pages of the PDF file when the document is opened, instead of waiting for the full file to be available. Optimized files are also generally smaller in size than non-optimized files.


2 Answers

If you looking for a Free (as in 'libre') Software, Ghostscript is surely your best choice. However, it is not always easy to use -- some of its (very powerful) processing options are not easy to find documented.

Have a look at this answer, which explains how to execute a more detailed control over image resolution downsampling than what the generic -dPDFSETTINGS=/screen does (that defines a few overall defaults, which you may want to override):

  • How to downsample images within pdf file?

Basically, it tells you how to make Ghostscript downsample all images to a resolution of 72dpi (this value is what -dPDFSETTINGS=/screen uses -- you may want to go even lower):

-dDownsampleColorImages=true \ -dDownsampleGrayImages=true \ -dDownsampleMonoImages=true \ -dColorImageResolution=72 \ -dGrayImageResolution=72 \ -dMonoImageResolution=72 \ 

If you want to try if Ghostscript is able to also 'un-embed' the fonts used (sometimes it works, sometimes not -- depending on the complexity of the embedded font, and also on the font type used), you can try to add the following to your gs command:

gs \   -o output.pdf \    [...other options...] \   -dEmbedAllFonts=false \   -dSubsetFonts=true \   -dConvertCMYKImagesToRGB=true \   -dCompressFonts=true \   -c ".setpdfwrite <</AlwaysEmbed [ ]>> setdistillerparams" \   -c ".setpdfwrite <</NeverEmbed [/Courier /Courier-Bold /Courier-Oblique /Courier-BoldOblique /Helvetica /Helvetica-Bold /Helvetica-Oblique /Helvetica-BoldOblique /Times-Roman /Times-Bold /Times-Italic /Times-BoldItalic /Symbol /ZapfDingbats /Arial]>> setdistillerparams" \   -f input.pdf 

Note: Be aware that downsampling image resolution will surely reduce quality (irreversibly), and dis-embedding fonts will make it difficult or impossible to display and print the PDFs unless the same fonts are installed on the machine....


Update

One option which I had overlooked in my original answer is to add

-dDetectDuplicateImages=true 

to the command line. This parameter leads Ghostscript to try and detect any images which are embedded in the PDF multiple times. This can happen if you use an image as a logo or page background, and if the PDF-generating software is not optimized for this situation. This used to be the case with older versions of OpenOffice/LibreOffice (I tested the latest release of LibreOffice, v4.3.5.2, and it does no longer do such stupid things).

It also happens if you concatenate PDF files with the help of pdftk. To show you the effect, and how you can discover it, let's look at a sample PDF file:

pdfinfo p1.pdf   Producer:       libtiff / tiff2pdf - 20120922  CreationDate:   Tue Jan  6 19:36:34 2015  ModDate:        Tue Jan  6 19:36:34 2015  Tagged:         no  UserProperties: no  Suspects:       no  Form:           none  JavaScript:     no  Pages:          1  Encrypted:      no  Page size:      595 x 842 pts (A4)  Page rot:       0  File size:      20983 bytes  Optimized:      no  PDF version:    1.1 

Recent versions of Poppler's pdfimages utility have added support for a -list parameter, which can list all images included in a PDF file:

pdfimages -list p1.pdf   page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio  --------------------------------------------------------------------------------------     1   0 image    423   600   rgb    3   8 jpeg     no     7  0    52    52 19.2K 2.6% 

This sample PDF is a 1-page document, containing an image, which is compressed with JPEG-compression, has a width of 423 pixels and a height of 600 pixels and renders at a resolution of 52 PPI on the page.

If we concatenate 3 copies of this file with the help of pdftk like so:

pdftk p1.pdf p1.pdf p1.pdf cat output p3.pdf 

then the result shows these image properties via pdfimages -list:

pdfimages -list p3.pdf   page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio  --------------------------------------------------------------------------------------     1   0 image   423    600   rgb    3   8 jpeg     no     4  0    52    52 19.2K 2.6%     2   1 image   423    600   rgb    3   8 jpeg     no     8  0    52    52 19.2K 2.6%     3   2 image   423    600   rgb    3   8 jpeg     no    12  0    52    52 19.2K 2.6% 

This shows that there are 3 identical PDF objects (with the IDs 4, 8 and 12) which are embedded in p3.pdf now. p3.pdf consists of 3 pages:

pdfinfo p3.pdf | grep Pages:   Pages:          3 

Optimize PDF by replacing duplicate images with references

Now we can apply the above mentioned optimization with the help of Ghostscript

 gs -o p3-optim.pdf -sDEVICE=pdfwrite -dDetectDuplicateImages=true p3.pdf 

Checking:

 pdfimages -list p3-optim.pdf   page num  type width height color comp bpc  enc interp objectID x-ppi y-ppi size ratio  --------------------------------------------------------------------------------------     1   0 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%     2   1 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6%     3   2 image   423    600   rgb    3   8 jpeg     no    10  0    52    52 19.2K 2.6% 

There is still one image listed per page -- but the PDF object ID is always the same now: 10.

 ls -ltrh p1.pdf p3.pdf p3-optim.pdf     -rw-r--r--@ 1 kp  staff    20K Jan  6 19:36 p1.pdf    -rw-r--r--  1 kp  staff    60K Jan  6 19:37 p3.pdf    -rw-r--r--  1 kp  staff    16K Jan  6 19:40 p3-optim.pdf 

As you can see, the "dumb" concatentation made with pdftk increased the original file size to three times the original one. The optimization by Ghostscript brought it down by a considerable amount.

The most recent versions of Ghostscript may even apply the -dDetectDuplicateImages by default. (AFAIR, v9.02, which introduced it for the first time, didn't use it by default.)

like image 54
Kurt Pfeifle Avatar answered Sep 28 '22 07:09

Kurt Pfeifle


You can obtain good results by converting from PDF to Postscript, then back to PDF using

pdf2ps file.pdf file.ps ps2pdf -dPDFSETTINGS=/ebook file.ps file-optimized.pdf 

The value of argument -dPDFSETTINGS defines the quality of the images in the resulting PDF. Options are, from low to high quality: /screen, /default, /ebook, /printer, /prepress, see http://milan.kupcevic.net/ghostscript-ps-pdf/ for a reference.

The Postscript file can become quite large, but the results are worth it. I went from a 60 MB PDF to a 140 MB Postscript file, but ended up with a 1.1 MB optimized PDF.

like image 45
Martijn de Milliano Avatar answered Sep 28 '22 08:09

Martijn de Milliano