Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove multiple embedded font in pdf created with pdfTk

Is there a way to remove fonts embedded multiple time from a pdf file?

This is my scenario:

1) a program generates several one-page pdf reports (querying a db, putting the info on an excel template and exporting the formatted information in pdf)

2) pdftk merges the single-page pdfs in one file.

Everything works fine, but the size of the resulting pdf is very large: in fact, I noticed that the fonts are embedded multiple times (as many time as the number of the page: all pages are generated starting from the same excel template, the fonts are embedded in the single pdf file and pdftk just glues the pdf). Is there a way to keet just one copy of each embedded font?

I tried to embed the fonts just in the first page while exporting from excel->pdf: the size of the file decreases dramatically, but it seems that the other pages can't access the embedded fonts.

Thanks, Alessandro

like image 237
AleV Avatar asked May 16 '12 21:05

AleV


2 Answers

You could try to 'repair' your pdftk-concatenated PDF using Ghostscript (but use a recent version, such as 9.05). In many cases Ghostscript will be able to merge the many subsetted fonts into fewer ones.

The command would look like this:

gswin32c.exe ^
    -o output.pdf ^
    -sDEVICE=pdfwrite ^
    -dPDFSETTINGS=/prepress ^
     input.pdf

Check with

pdffonts.exe  output.pdf
pdffonts.exe  input.pdf 

how many instances of various font subsets are in each file (pdffonts.exe is available here as part of a small package of commandline tools).

But don't complain about the 'slow speed' of this process -- Ghostscript does interprete completely all PDF input files to accomplish its task, while the pdftk file concatenation is a much simpler process...


Update:

Instead of pdftk you could use Ghostscript to merge your input PDF files. This could possibly avoid the problem you was seeing with the a posteriori Ghostscript 'repair' of your pdftk-merged files. Note, this will be much slower than the 'dumb' pdftk merge. However, the results may please you better, especially regarding the font handling and file size.

This would be a possible command:

gswin32c.exe ^
    -o output.pdf ^
    -sDEVICE=pdfwrite ^
    -dPDFSETTINGS=/prepress ^
     input.pdf

You can add more options to the Ghostscript CLI for a more fine-tuned control over the merge and optimization process.

In the end you'll have to decide between the extremes:

  • 'Fast' pdftk producing large output files, vs.
  • 'Slow' gswin32c.exe (Ghostscript) producing lean output files.

I'd be interested if you would post some results (execution time and resulting file sizes) for both methods for a number of your merge processes...


Update 2: Sorry, my previous version contained a typo.
It's not -sPDFSETTINGS=... but it must be -dPDFSETTINGS=... (d in place of s).


Update 3:

Since your source files are Excel sheets made from templates (which usually don't use a lot of different fonts), you could try to use a trick to make sure Ghostscript has all the required glyphs of the fonts used in all to-be-merged-later PDFs:

  • For each font and face (standard, italic, bold, bold-italic) add a table cell into your template sheet at the top left of your print area.
  • Fill this table cell with all printable characters and punctuation signs from the ASCII alphabet: 0123456789, ABCD...XYZ, abc...xyz, :-_;°%&$§")({}[] etc.
  • Make the cell (and the fontsize) as small as you want or need in order to not disturb your overall layout. Use the color white to format the characters in the cell (so they appear invisible in the final PDF).

This method will hopefully make sure that each of your PDFs will use the same subset of glyphs which would then avoid the problems you observed when merging the files with Ghostscript. (Note, that you if you use f.e. Arial and Arial-Italic, you have to create 2 such cells: one formatted with the standard Arial typeface, the other one with the italic one.)

like image 112
Kurt Pfeifle Avatar answered Nov 08 '22 09:11

Kurt Pfeifle


Fonts are usually subset when creating PDF files, so that they only contain the required glyphs. In addition, the encoding is altered so that the first glyph used is assigned character code 1, the second is 2 and so on.

As a result the first PDF file might contain a font where 0x01 = A, 0x02 = space, 0x03 = t, 0x04 = e and 0x05 = s. The second file might contain a font where 0x01 = T, 0x02 = e, 0x03 =s, 0x04 = t

In order not to get confused, a prefix is added to the name of the font in the document. This prefix is stripped out by Acrobat when displaying the font embedding, so it seems like you have multiple instances of the same font. However they are in fact different font, and cannot readily be combined.

Assuming this is the case (and I would need to see your files to be sure) it 'may' be possible to avoid this. If you set the PDF producing software so that it does not subset fonts then pdftk might be able to merge the documents without including the same font multiple times. I haven't tested this obviously, but it might work. Your other option is to modify your workflow so that the reports are produced as multiple page documents in the first place.

like image 38
KenS Avatar answered Nov 08 '22 08:11

KenS