Ghostscript skips characters when merging PDFs

Tags:

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.

The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.

Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.

The command I am using is:

gs \
   -q \
   -dNOPAUSE \
   -sDEVICE=pdfwrite \
   -sOutputFile=/tmp/outputfilename \
   -dBATCH \
    /var/www/documents/docs/input1.pdf \
    /var/www/documents/docs/input2.pdf \
    /var/www/documents/docs/input3.pdf

Anyone else that have experienced this, or even better know a solution for it?

609

asked Oct 09 '12 19:10

Mr R

2 Answers

I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).

Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:

 for i in input*.pdf; do
     pdffonts ${i} | tee ${i}.pdffonts.txt
 done

Look for the font names used in each PDF.

My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.

The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...

It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).

This leads to some characters missing in merged output.

I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).

I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.

What you could do to avoid this problem

Try to always embed fonts as full sets, not subsets:
- I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
- If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
- If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
- If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
  gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
- I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
- If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
- If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
- If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
  gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...

Update

To answer once more, but this time more explicitly (following the extra question of @sacohe in the comments below. In many (not all) cases the following procedure will work:

Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf

The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).

This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).

125

answered Nov 28 '22 13:11

Kurt Pfeifle

I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.

you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. @Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?

another ugly method that halfway works is to use:

-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false

which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.

(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)

answered Nov 28 '22 13:11

q7joey

Related questions
                            
                                Debugging PDF for error
                            
                                How do I determine the size of a pdf with pdf.js so I can scale to the screen size?
                            
                                PDFs are missing images when compiling knitr .RNW examples
                            
                                Trying to find list of supported CSS for TCPDF
                            
                                How to make pdf file password protected?
                            
                                Add image (png file) to header of pdf file created with R
                            
                                VBA Print to PDF and Save with Automatic File Name
                            
                                Use PDF.js offline
                            
                                Save Paragraphs as a PDF dynamically?
                            
                                Write a blob file in filesystem with Cordova - Ionic
                            
                                Android: PDF files hidden despite setting the MIME type
                            
                                UIMarkupTextPrintFormatter never renders base64 images
                            
                                How to print PDF to ZPL (Zebra Printers) using c#?
                            
                                HTML hyperlink to a specific page of a pdf file
                            
                                PDF Text Extraction Approach Using OCR [closed]
                            
                                iPad UIWebView PDF rendering is giving me weird visual artifacts
                            
                                HTML To PDF Turkish Character Problem
                            
                                Ghostscript merge of pdf's is causing orientation to flip
                            
                                Adding Image watermark to Pdf while Creating it using iTextSharp
                            
                                Itext multiple signatures

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Ghostscript skips characters when merging PDFs

Tags:

merge

pdf

ghostscript