Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.

The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.

Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.

The command I am using is:

gs \
   -q \
   -dNOPAUSE \
   -sDEVICE=pdfwrite \
   -sOutputFile=/tmp/outputfilename \
   -dBATCH \
    /var/www/documents/docs/input1.pdf \
    /var/www/documents/docs/input2.pdf \
    /var/www/documents/docs/input3.pdf 

Anyone else that have experienced this, or even better know a solution for it?

like image 609
Mr R Avatar asked Oct 09 '12 19:10

Mr R


People also ask

How do I merge PDF files in Ghostscript?

ghostscript is commonly/typically found pre-installed on unix-like operating systems (e.g. linux, MacOS) and supports a command-line invocation for merging multiple PDF files into a single PDF file: gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=combined. pdf -dBATCH pdf1. pdf pdf2.

How do I combine PDF files without losing quality?

Click the Select files button above or drag and drop files into the drop zone. Select the files you want to merge using the Acrobat PDF combiner tool. Reorder the files if needed. Click Merge files.

What is PDFmerge?

PDFmerge offers a pretty straightforward and easy-to-use interface for combining PDF files. You can combine up to 4 files at once. The only downside to this tool is that it's only free for up to 10MB worth of files.


2 Answers

I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).

Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:

 for i in input*.pdf; do
     pdffonts ${i} | tee ${i}.pdffonts.txt
 done

Look for the font names used in each PDF.

My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.

The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...

It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).

This leads to some characters missing in merged output.

I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).

I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.

What you could do to avoid this problem

  1. Try to always embed fonts as full sets, not subsets:

    • I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
    • If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
    • If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
    • If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
      gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file

    Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.

  2. Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.

    • I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
    • If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
    • If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
    • If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
      gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file

    Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.

  3. Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...


Update

To answer once more, but this time more explicitly (following the extra question of @sacohe in the comments below. In many (not all) cases the following procedure will work:

  • Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).

  • The command to use is this (or similar):
    gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf

The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).

This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).

like image 125
Kurt Pfeifle Avatar answered Nov 28 '22 13:11

Kurt Pfeifle


I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.

you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. @Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?

another ugly method that halfway works is to use:

-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false

which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.

(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)

like image 25
q7joey Avatar answered Nov 28 '22 13:11

q7joey