Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ghostscript won't generate PDF/A with UTF16BE text string detected in DOCINFO - in spite of PDFACompatibilityPolicy saying otherwise

I am trying to convert normal PDF files to PDF/A with this command line:

gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output.pdf input.pdf

However, I get the message

GPL Ghostscript 9.26: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, reverting to normal PDF output

an gs reverts to normal PDF. Apparently, the message stems from this code fragment of gs, but there we read that the message can occur only when pdev->PDFACompatibilityPolicy == 0. My understanding was that the parameter -sPDFACompatibilityPolicy=1 in the command line has the purpose of preventing this.

Q: Why does gs behave as if the desired policy were 0 instead of 1? Is there another way to set the policy to 1?

Also, just as it makes me curious:

Q: Is there a way to see what kind of strange DOCINFO there is causing the original problem or to prevent it in the first place? Using Acrobat Reader, I cannot see anything "suspicuous" in the file. If it helps: The input.pdf is generated on Window from Word (and I tried even with the UseISO19005-1 setting, which should produce PDF/A to begin with, but the problem occurs anyway).

like image 501
Hagen von Eitzen Avatar asked Dec 31 '22 19:12

Hagen von Eitzen


2 Answers

You have put -sPDFACompatibilityPolicy=1. That, I'm afraid, is incorrect. Ghostscript has two kinds of switches -s which deals with string values, and -d which deals with numeric and name values (names in PostScript begin with '/').

You've assigned a string value of '1' to the parameter PDFACompatbilityPolicy, which (internally) expects a numeric value. For reasons to do with the fact that these values are required to be accessible from the PostScript environment, we can't flag the type confusion as an error. Instead we leave the actual control at its default value of 0.

If you instead set -dPDFACompatibilityPolicy=1 I expect you will see the behaviour you expect.

As for seeing the data, without looking at the PDF file I cannot tell. However, if you stop in the debugger at that point and look at p->data you will be able to see what the data is. If you look at pairs + i instead of pairs + i + 1 you will be able to see the key which is associated with the value from the DOCINFO pdfmark.

You won't be able to see anything 'suspicious' by looking at the file in Acrobat, because Acrobat will translate the UTF16BE into whatever your system requires in order to display the text correctly. It may even be that this is ASCII, you can still represent that as UTF16.

If you open the file in a text editor you may be able to see the relevant string (note that the BOM in Ghostscript is in octal, so that's 0xFE 0xFF in hexadecimal), provided its not in a compressed object stream.

like image 187
KenS Avatar answered Jan 09 '23 10:01

KenS


Examining the source of latest ghostscript (9.50), it seems that the PDFACompatibilityPolicy values in this case (see devices/vector/gdevpdfm.c around line 1951) set the error-containing behavior as such:

  • 0 will revert to normal PDF output (not really what I wanted)
  • 1 will discard PDFINFO (even worse)
  • 2 will throw an error (even even worse)
  • any other value is ignored in the switch and works as a pass-through!

So, in my case, the whole thing was solved simply by setting

-dPDFACompatibilityPolicy=3

Ghostscript does not complain, does not abort PDF/A output, does not discard the PDFINFO, and, most importantly, veraPDF checker still verifies the PDF as perfectly okay.

I'm not commenting on how ugly this solution is, but it works just great. Since all other switch statements just assume compatibility policy 0 if anything above 2 gets passed in, this "shortcut" seems to be an unintended, but very useful bug.

like image 23
exa Avatar answered Jan 09 '23 11:01

exa