Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling (remapping) missing/problematic (CID/CJK) fonts in PDF with ghostscript?

In brief, I'm dealing with a problematic PDF, which:

  • Cannot be fully rendered in a document viewer like evince, because of missing font information;
  • However - ghostscript can fully render the same PDF.

Thus -- regardless of what ghostscript uses to fill in the blanks (maybe fallback glyphs, or a different method to accessing fonts) -- I'd like to be able to use ghostscript to produce ("distill") an output PDF, where pretty much nothing will be changed, except font information added, so evince can render the same document in the same manner as ghostscript can.

My question is thus - is this possible to do at all; and if so, what would be command line be to achieve something like that?

Many thanks in advance for any answers,
Cheers!


Details:

I'm actually on an older Ubuntu 10.04, and I might be experiencing - not a bug - but an installation problem with evince (lack of poppler-data package), as noted in Bug #386008 “Some fonts fail to display due to “Unknown font tag...” : Bugs : “poppler” package : Ubuntu.

However, that is exactly what I'd like to handle, so I'll use the fontspec.pdf attached to that post ("PDF triggering the bug.", // v.) to demonstrate the problem.

evince

First, I open this pdf's page 3 in evince; and evince complains:

$ evince --page-label=3 fontspec.pdf

Error: Missing language pack for 'Adobe-Japan1' mapping
Error: Unknown font tag 'F5.1'
Error (7597): No font in show
Error: Unknown font tag 'F5.1'
Error (7630): No font in show
Error: Unknown font tag 'F5.1'
Error (7660): No font in show
Error: Unknown font tag 'F5.1'
...

The rendering looks like this:

evince-pdf-missfont-render.png

... and it is obvious that some font shapes are missing.

Adobe acroread

Just a note on how Adobe's Acrobat Reader for Linux behaves; the following command line:

$ ./Adobe/Reader9/bin/acroread /a "page=3" fontspec.pdf

... generates no output to terminal whatsoever (for more on /a switch, see Man page acroread) -- and the program has absolutely no problem displaying the fonts.

Also, while I'd like to avoid the roundtrip to postscript - however, note that acroread itself can be used to convert a PDF to postscript:

$ ./Adobe/Reader9/bin/acroread -v
9.5.1

$ ./Adobe/Reader9/bin/acroread -toPostScript \ 
-rotateAndCenter -choosePaperByPDFPageSize \
-start 3 -end 3 \
-level3 -transQuality 5 \
-optimizeForSpeed -saveVM \
fontspec.pdf ./ 

Again, the above command line will generate no output to terminal; -optimizeForSpeed -saveVM are there because apparently they deal with fonts; the last argument ./ is the output directory (output file is automatically called fontspec.ps).

Now, evince can display the previously missing fonts in the fontspec.ps output - but again complains:

$ evince fontspec.ps 
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
GPL Ghostscript 9.02: Error: Font Renderer Plugin ( FreeType ) return code = -1
...

... and furthermore, all text seems to be flattened to curves in the postscript - so now one cannot select the text in the .ps file in evince anymore (note that the .ps file cannot be opened in acroread). However, one can convert this .ps back into .pdf again:

$ pstopdf fontspec.ps   # note, `pstopdf` has no output filename option;
                        # it will automatically choose 'fontspec.pdf',
                        # and overwrite previous 'fontspec.pdf' in 
                        # the same directory 

... and now the text in the output of pstopdf is selectable in evince, all fonts are there, and evince doesn't complain anymore. However, as I noted, I'd like to avoid roundtrip to postscript files altogether.

display (from imagemagick)

We can also observe the page in the same document with imagemagicks display (note that image panning from the commandline using 'display' is apparently still not available, so I've used -crop below to adjust the viewport):

$ display -density 150 -crop 740x450+280+200 fontspec.pdf[2]
   **** Warning: considering '0000000000 00000 n' as a free entry.
...
   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

... which generates some ghostscripish errors - and results with something like this:

imagemagick-display-pdf.png

... where it's obvious that the missing fonts that evince couldn't render, are now shown here, with imagemagicks display, properly.

ghostscript

Finally, we can use ghostscript as x11 viewer itself -- to observe the same page, same document:

$ gs -sDevice=x11 -g740x450 -r150x150 -dFirstPage=3 \
-c '<</PageOffset [-120 520]>> setpagedevice' \
-f fontspec.pdf

GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 74.
Page 3
>>showpage, press <return> to continue<<
^C

... and results with this output:

ghostscript-pdf-view.png

 

In conclusion: ghostscript (and apparently by extension, imagemagick) can seemingly find the missing font (or at least some replacement for it), and render a page with that -- even if evince fails at that for the same document.

I would, therefore, simply like to export a PDF version from ghostscript, that would have only the missing fonts embedded, and no other processing; so I try this:

$ gs -dBATCH -dNOPAUSE -dSAFER  \
-dEmbedAllFonts -dSubsetFonts=true -dMaxSubsetPct=99 \
-dAutoFilterMonoImages=false \
-dAutoFilterGrayImages=false \
-dAutoFilterColorImages=false \
-dDownsampleColorImages=false \
-dDownsampleGrayImages=false \
-dDownsampleMonoImages=false \
-sDEVICE=pdfwrite \
-dFirstPage=3 -dLastPage=3 \
-sOutputFile=mypg3out.pdf -f fontspec.pdf

GPL Ghostscript 9.02 (2011-03-30)
Copyright (C) 2010 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
   **** Warning: considering '0000000000 00000 n' as a free entry.
Processing pages 3 through 3.
Page 3

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> Mac OS X 10.5.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

... but it doesn't work - the output file mypg3out.pdf suffers from the exact same problems in evince as noted previously.

Note: While I'd like to avoid the postscript roundtrip, a good example of gs command line with from pdf to ps with font embedding is here: (#277826) pdf - How to make GhostScript PS2PDF stop subsetting fonts; but the same command line switches for .pdf to .pdf to not seem to have any effect on the problem described above.

like image 516
sdaau Avatar asked Jun 19 '12 00:06

sdaau


2 Answers

OK point 1; you CANNOT use Ghostscript and pdfwrite to create a PDF file 'without any additional processing'.

The way that pdfwrite and Ghostscript work is to fully interpret the incoming data (PostScript, PDF, XPS, PCL, whatever), creating a series of graphics primitives, which are passed to the pdfwrite device. The pdfwrite device then reassembles these into a brand new PDF file.

So its not possible to take a PDF file as input and manipulate it, it will always create a new file.

Now, I would suggest that you upgrade your 9.02 Ghostscript to 9.05 to start with. Missing CIDFonts are much better handled in 9.05 (and will be further improved in 9.06 later this year). (The font you are missing 'Osaka Mono' is in fact a CIDFont, not a regular font)

Using the current bleeding edge Ghostscript code produces a PDF file for me which has the missing font embedded. I can't tell if this will work for you because my copy of evince renders the original file perfectly well.

Added later

Examining the original PDF file I see that the fonts there are indeed embedded (as I would expect, since they are subsets). So in fact as you say in your own answer above, the problem is not font embedding, but the use of CIDFonts.

My answer here will not help you, as pdfwrite will still produce a CIDFont in the output. Basically this is a flaw in your version or installation of evince.

The problem with 'remapping' the characters is that a font is limited to 256 glyphs, while a CIDFont has effectively no limit. So there is no way to put a CIDFont into a Font. The only way to do this would be to create multiple Fonts each of which contained a portion of the original, and then switch between them as required. Slow and klunky.

If you convert to PostScript using the ps2write device then it will do this for you, but you stand a great risk that in the process it will convert the vector glyph data into bitmaps, which will not scale well.

Fundamentally you can't really achieve what you want to do (convert 1 CIDFont into N regular Fonts) with Ghostscript, or in fact with any other tool that I know of. While its technically possible, there is no real point since all PDF consumers should be able to handle CIDFonts. If they can't then its a bug in the PDF consumer.

like image 125
KenS Avatar answered Oct 11 '22 09:10

KenS


Right, I got a bit further on this (but not completely) - so I'll post a partial answer/comment here.

Essentially, this is not a problem about font embedding in PDF - this is a problem with font mapping.

To show that, let's analyse the mypg3out.pdf, which was extracted by gs in the OP (from the 3rd page of the fontspec.pdf document):

$ pdffonts mypg3out.pdf 
name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
Error: Missing language pack for 'Adobe-Japan1' mapping
CAAAAA+Osaka-Mono-Identity-H         CID TrueType      yes yes yes     19  0
GBWBYF+CMMI9                         Type 1C           yes yes yes     28  0
FDFZUN+Skia-Regular_wght13333_wdth11999 TrueType          yes yes yes     16  0
ZRLTKK+Optima-Regular                TrueType          yes yes yes     30  0
ZFQZLD+FPLNeu-Bold                   Type 1C           yes yes yes      8  0
DDRFOG+FPLNeu-Italic                 Type 1C           yes yes no      22  0
HMZJAO+FPLNeu-Regular                Type 1C           yes yes yes     10  0
RDNKXT+FPLNeu-Regular                Type 1C           yes yes yes     32  0
GBWBYF+Skia-Regular_wght13333_wdth11999 TrueType          yes yes no      26  0

As the output shows - all fonts are, indeed, embedded; so something else is the problem. (It would have been more difficult to observe this in the complete fontspec.pdf, as there are a ton of fonts there, and a ton of error messages.)

The crucial point (I think) here, is that there is:

  • only one "Error: Missing language pack for 'Adobe-Japan1' mapping" message; and
  • only one CID TrueType font, which is CAAAAA+Osaka-Mono-Identity-H

There seems to be an obvious relationship between the CID TrueType and the 'Adobe-Japan1' mapping error; and I got that finally clarified by CID fonts - How to use Ghostscript:

CID fonts are PostScript resources containing a large number of glyphs (e.g. glyphs for Far East languages, Chinese, Japanese and Korean). Please refer to the PostScript Language Reference, third edition, for details.

CID font resources are a different kind of PostScript resource from fonts. In particular, they cannot be used as regular fonts. CID font resources must first be combined with a CMap resource, which defines specific codes for glyphs, before it can be used as a font. This allows the reuse of a collection of glyphs with different encodings.

All good - except here we're dealing with PDF fonts, not PostScript fonts as such; let's demonstrate that a bit.

For instance, 5.3. Using Ghostscript To Preview Fonts - Making Fonts Available To Ghostscript - Font HowTo describes how the Ghostscript-installed script called prfont.ps can be used to render a table of fonts.

However, here it would be easier with just Listing Ghostscript Fonts [gs-devel], and using resourcestatus operator to query for a specific font - which doesn't require a special .ps script:

$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c 'currentpagedevice (*) {=} 100 string /Font resourceforall'
...
Processing pages 1 through 1.
Page 1
URWAntiquaT-RegularCondensed
Palatino-Italic
Hershey-Gothic-Italian
...

$ gs -o /dev/null -dNODISPLAY -f mypg3out.pdf \
-c '/TimesNewRoman findfont pop [/TimesNewRoman /Font resourcestatus]'
....
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.02/Resource/Font/TimesNewRomanPSMT.
Can't find (or can't open) font file TimesNewRomanPSMT.
Querying operating system for font files...
Loading TimesNewRomanPSMT font from /usr/share/fonts/truetype/msttcorefonts/times.ttf... 2549340 1142090 3496416 1237949 1 done.

We got a list of fonts; however, those are system fonts available to ghostscript - not the fonts embedded in the PDF!

(Basically,

  • gs -o /dev/null -dNODISPLAY -f mypg3out.pdf -c 'currentpagedevice (*) {=} 100 string /Font resourceforall' | grep -i osaka will return nothing, and
  • -c '/CAAAAA+Osaka-Mono-Identity-H findfont pop [/CAAAAA+Osaka-Mono-Identity-H /Font resourcestatus]' will conclude with "Didn't find this font on the system! Substituting font Courier for CAAAAA+Osaka-Mono-Identity-H.")

To list the fonts in the PDF, the pdf_info.ps script file from Ghostscript (not installed, in sources) can be used:

$ wget "http://git.ghostscript.com/?p=ghostpdl.git;a=blob_plain;f=gs/toolbin/pdf_info.ps" -O pdf_info.ps

$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsNeeded pdf_info.ps
...
No system fonts are needed.

$ gs -dNODISPLAY -q -sFile=mypg3out.pdf -dDumpFontsUsed -dShowEmbeddedFonts pdf_info.ps
...
Font or CIDFont resources used:
CAAAAA+Osaka-Mono
DDRFOG+FPLNeu-Italic
FDFZUN+Skia-Regular_wght13333_wdth11999
GBWBYF+CMMI9
GBWBYF+Skia-Regular_wght13333_wdth11999
GTIIKZ+Osaka-Mono
HMZJAO+FPLNeu-Regular
RDNKXT+FPLNeu-Regular
ZFQZLD+FPLNeu-Bold
ZRLTKK+Optima-Regular

So finally we can observe the CAAAAA+Osaka-Mono in Ghostscript - although I wouldn't know how to query more specific information about it from within ghostscript.

 

In the end, I guess my question boils down to: how could ghostscript be used, to map the glyphs from a CID embedded font - into a font with a different "encoding" (or "character map"?), which will not require additional language files?

Addendum

I have also experimented with these approaches:

  • pdffonts on the output here will not have the Osaka-Mono listed, but it will still complain "Error: Missing language pack for 'Adobe-Japan1' mapping":
    $ wget http://whalepdfviewer.googlecode.com/svn/trunk/cmaps/japanese/Adobe-Japan1-UCS2
    $ gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH -f mypg3out.pdf Adobe-Japan1-UCS2
  • same as previously - this (via Ghostscript's "Use.htm") also makes Osaka-Mono disappear from pdffonts list:
    gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
    -c '/CIDSystemInfo << /Registry (Adobe) /Ordering (Unicode) /Supplement 1 >>' \
    -f mypg3out.pdf
  • this crashes with Error: /undefinedresource in findresource:
    gs -sDEVICE=pdfwrite -o mypg3o2.pdf -dBATCH \
    -c '/Osaka-Mono-Identity-H /H /CMap findresource [/Osaka-Mono-Identity /CIDFont findresource] == ' \
    -f mypg3out.pdf

Note finally that some of the .ps scripts ghostscript installs, it may use automatically; for instance, you can find gs_ttf.ps:

$ locate gs_ttf.ps
/usr/share/ghostscript/9.02/Resource/Init/gs_ttf.ps

... and then using sudo nano locate gs_ttf.ps, you can add the statement (Hello from gs_ttf.ps\n) print at the beginning of the code; then whenever one of the above gs commands is called, the printout will be visible in stdout.

References

  • Adding your own fonts - Fonts and font facilities supplied with Ghostscript
  • About "CIDFnmap" of Ghostscript - Features to support CJK CID-keyed in Ghostscript
  • Bug 689538 – GhostScript can not handle an embedded TrueType CID-Font
  • Bug 692589 – "Error CIDSystemInfo and CMap dict not compatible" when converting merged file to PDF/A - #1522
  • Adobe Forums: CMap resources versus PDF mapping resources:
    Please keep in mind that a CMap resource unidirectionally maps character codes to CIDs. Those other resources that Acrobat uses are best referred to as PDF mapping resources. Among them, there is a special category called ToUnicode mapping resources that unidirectionally map CIDs to UTF-16BE character codes
  • Adobe CIDs and glyphs in CJK TrueType font
  • Ghostscript and Japanese TrueType font
  • Installation guide: GS and CID font
  • Debian -- Filelist of package poppler-data/sid/all
like image 3
sdaau Avatar answered Oct 11 '22 08:10

sdaau