Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF font names with spaces give rise to printer errors

Some background: I maintain an archive of largely unindexed scientific literature, and in this context use scanning of paper documents with subsequent OCR to produce searchable text. Which worked great until the university switched to printers without OCR capability. Then I had to retreat and rely on separate scanning and OCR. For which I chose to use Adobe Acrobat Pro. It seemed to work pretty OK until one day I realized I couldn't print some of the documents I'd been working on (printing from Mac Preview, not Adobe Acrobat). The error message from the printer (Ricoh IM C4500) was:

  ERROR: undefined
  OFFENSIVE COMMAND: New
  STACK:
  /AAAAAC+*Times
  /FontName

My understanding of PDF is limited, but by first printing the PS (still from Preview) and then using Adobe Distiller to regenerate the PDF, I was able to reproduce the error that caused me to replace all font names "with spaces" to font names "with dashes" like this (in the PS):

  Times New Roman -> Times-New-Roman

That made Adobe Distiller happy and the regenerated PDF could be printed without issue. I then tried to do the same with pdf2ps and ps2pdf. Interestingly, these two programs worked together and fixed the problem without me manually intervening as I did above.

At this point I should have an MWE to show you, but I'm not sure how to do that. Not with PDF in the picture. Also, I think the cause of the problem is clear enough. The two questions are:

  1. How to fix the files that are already stained?
  2. How to avoid the font name problem in the future?

The files in the archive are many and I can't see a viable solution without using the command line. Which is fine by me. E.g. by running pdffonts as discussed in How to find out which fonts are referenced and which are embedded in a PDF document to get a list of fonts used in the PDF. But how can I continue to "edit" the PDF in case it has font names with spaces? I assume the file needs to be rebuilt somehow, but here I really need some advice. To me it seems that e.g. GhostScript would be an ideal candidate for cleaning PDF files at this level, but that might be naive.

like image 993
Tore H-W Avatar asked Dec 21 '25 06:12

Tore H-W


1 Answers

I hope it is OK that I answer my own question. If not, please tell me how to proceed...

Following up the advices of @K J and @johnwhitington I ended up coding a BASH script based on the mixed use of pdffonts, qpdf, xxd and gs. The idea is to generate an editable pdf (qpdf), transform this into hex code (xxd) and do the same with the font name patterns, perform straightforward sed substitutions, convert back to qpdfand finally cleanse the pdf-format using gs. I only have a marginal understanding of what goes on behind the scenes here, but hope that someone will explain it in this post.

The code shown below answers my question no 1. About question 2 I see no alternative except running the same script each time I do OCR in Adobe Acrobat Pro. An alternative to AA would be great but I haven't seen many (and I do not have the time to train Tesseract myself.) The script has been tested on about 1000 files from about 100 different creators so far and seems to be stable except for one known bug, namely mixed use of literal space ' ' and hexadecimal space #20 in the pdf (which I don't have observed so far.)

#!bin/bash
#script tested on GNU bash, version 5.2.21(1)
#
# ------------------------------------------------------------------------------
# Purpose: Use output from "pdffonts" to patch up any font names "with spaces", 
#          if occurent in PDF file. Such files have been observed to crash the 
#          (Postscript) printer when printed from Mac Preview while working OK 
#          when printed from Adobe Acrobat. The spaces can be encoded either as 
#          literal ' ' or as hexadecimal #20 (but not mixed usage). The two 
#          forms will be replaced by '-' or #2D respectively; in the patched 
#          pdf file which is generated by Ghostscript. No further changes are 
#          made to PDF, but other hidden issues/warnings/errors with the fonts 
#          may also come to light using "pdffonts", so watch out for any extra 
#          output from stderr. 
#
#          The heuristic of the font name patch works along these lines:
#
#          1. Use pdffonts to make list of font names "with spaces"
#          2. Run qpdf on PDF to make qpdf file format
#          3. Run xxd on 2. to make hexadecimal text without (extra) newlines
#          4. Run xxd on 1. to transform into hexadecimal search patterns
#          5. Use sed to substitute patterns 4. in file 3.
#          6. Run xxd to transform 5. back to qpdf format
#          7. Run gs on 6. to make patched pdf 
#
# Names  : SRCDIR (source directory for the PDFs)
#          OUTDIR (output work directory)
#          PDF    (running pdf filename, traversed in depth by "find")
#          JOB    (stem of output filenames in OUTDIR, see next item below)
#          FONTS  (unique list of font names "with spaces")
#          PAT    (element in FONTS, iterator)
#          PATx   (transformed hexadecimal PATs for specific use x, see code)
#
# Output : After a successful font name patch there will be a number of output
#          files stored in OUTDIR:
#
#          JOB.qdf.pdf     (output from qpdf)
#          JOB.xxd.qdf.pdf (output from xxd)
#          JOB.xxd.txt     (same as JOB.xxd.qdf.pdf, but without newlines)
#          JOB.gs.qdf.pdf  (output from gs, the patched PDF)
#          JOB.log         (log-file)
#
# Author : Tore Haug-Warberg
# Since  : 2024-01-17
# Note   : The regex used to isolate the font names is tailored to pdffonts 
#          v3.03 which states there are three different kinds of fonts:
#          - Type 1
#          - Type 1C - aka Compact Font Format (CFF)
#          - Type 3
#          - TrueType
#          - CID Type 0 - 16-bit font with no specified type
#          - CID Type 0C - 16-bit PostScript CFF font
#          - CID TrueType - 16-bit TrueType font
#          This info is encoded as (CID)?[ ](True|Type).*$) in the code below
# Usage  : Change file destinations RCDIR, OUTDIR; change maybe the command 
#          'find -s "$SRCDIR" -iname ...' to your needs; run script
# ------------------------------------------------------------------------------

SRCDIR=~/Foo/
OUTDIR=~/Bar/

find -s "$SRCDIR" -iname "*.pdf" -not -iname "*_orig.pdf" | \
while read PDF; 
do 
  JOB="$OUTDIR/$(basename "$PDF" '.pdf')"; 
  echo "$PDF"; 

  # Scan output from pdffonts looking for font names "with spaces". There is no
  # grammar for this and the sed pattern used below will sometimes fail: Only
  # the part of the font name which consists of alphanumeric text (plus space)
  # are recognized by sed | sort | uniq (for speeding up the text processing).
  # However, if the sed pattern fails all font names containing spaces will
  # still undergo space substitution, it just takes more time
  IFS=$'\n' \
  FONTS=$(pdffonts "$PDF" | \
          sed -n '3,$p' | \
          sed -E 's/^(.*)([ ]+CID[ ]+(True|Type).*$)/\1/' | \
          sed -E 's/^(.*)([ ]+(True|Type).*$)/\1/' | \
          sed -e 's/[ ]*$//' | \
          grep -e '[ ]' | \
          sed -E 's/^[^[:alnum:]]*([[:alnum:]]+[ ][[:alnum:]\ ]+).*$/\1/' | \
          sort | uniq);

  # Test that there are no font names "with spaces" in the list
  if [[ "" == "$FONTS" ]];
  then
    continue
  else
    echo "$PDF" > "$JOB".log; 
    echo "$FONTS" >> "$JOB".log; 
  fi

  # Transform PDF, first to qpdf-format and then to hexadecimal text with no
  # (extra) newlines. So that we can run 'sed' on the entire shebang without
  # knowing the qpdf file structure. This only works for PDFs of modest size,
  # but at least a few tenths of MB works fine
  qpdf --qdf "$PDF" "$JOB".qdf.pdf;
  xxd -p -u "$JOB".qdf.pdf | tr -d ' \n' > "$JOB".xxd.txt;

  # Transform PAT into hexadecimal search patterns PATa and PATb (for font names
  # spelled with literal space ' ') and patterns PATc and PATd (for font names
  # spelled with hexadecimal #20). Both alternatives must be tested because
  # pdffonts outputs literal space even if #20 is used in PDF. The simple test
  # therefore only works if there is no mixed use of ' ' and #20 in PDF. In 
  # which case it will fail to replace anything at all.
  for PAT in $FONTS;    
  do 
    PATa=$(echo "$PAT"                        | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATb=$(echo "$PAT" | sed -e 's/[ ]/-/g'   | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATc=$(echo "$PAT" | sed -e 's/[ ]/#20/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATd=$(echo "$PAT" | sed -e 's/[ ]/#2D/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
    sed -i '' -e "s/$PATa/$PATb/g" "$JOB".xxd.txt;
    sed -i '' -e "s/$PATc/$PATd/g" "$JOB".xxd.txt;
  done;

  # Transform from hexadecimal back to qpdf format
  xxd -ps -r "$JOB".xxd.txt "$JOB".xxd.qdf.pdf;

  # Create a new pdf with the same timestamp as the PDF but now patched so that 
  # there are no font names "with spaces"
  gs -o "$JOB".gs.qdf.pdf \
     -sDEVICE=pdfwrite \
     -dPDFSETTINGS=/default \
     "$JOB".xxd.qdf.pdf >> "$JOB".log;
  touch -r "$PDF" "$JOB".gs.qdf.pdf;

  # Log result
  pdffonts "$JOB".gs.qdf.pdf >> "$JOB".log;
  echo '--- qpdf+xxd+gs (end)'; 
done
like image 173
Tore H-W Avatar answered Dec 23 '25 22:12

Tore H-W



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!