I am trying to convert normal PDF files to PDF/A with this command line:
gs -dPDFA -dBATCH -dNOPAUSE -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output.pdf input.pdf
However, I get the message
GPL Ghostscript 9.26: UTF16BE text string detected in DOCINFO cannot be represented in XMP for PDF/A1, reverting to normal PDF output
an gs reverts to normal PDF.
Apparently, the message stems from this code fragment of gs, but there we read that the message can occur only when pdev->PDFACompatibilityPolicy == 0
. My understanding was that the parameter -sPDFACompatibilityPolicy=1
in the command line has the purpose of preventing this.
Q: Why does gs behave as if the desired policy were 0 instead of 1? Is there another way to set the policy to 1?
Also, just as it makes me curious:
Q: Is there a way to see what kind of strange DOCINFO there is causing the original problem or to prevent it in the first place? Using Acrobat Reader, I cannot see anything "suspicuous" in the file. If it helps: The input.pdf is generated on Window from Word (and I tried even with the UseISO19005-1 setting, which should produce PDF/A to begin with, but the problem occurs anyway).
You have put -sPDFACompatibilityPolicy=1
. That, I'm afraid, is incorrect. Ghostscript has two kinds of switches -s
which deals with string values, and -d
which deals with numeric and name values (names in PostScript begin with '/').
You've assigned a string value of '1' to the parameter PDFACompatbilityPolicy, which (internally) expects a numeric value. For reasons to do with the fact that these values are required to be accessible from the PostScript environment, we can't flag the type confusion as an error. Instead we leave the actual control at its default value of 0.
If you instead set -dPDFACompatibilityPolicy=1
I expect you will see the behaviour you expect.
As for seeing the data, without looking at the PDF file I cannot tell. However, if you stop in the debugger at that point and look at p->data you will be able to see what the data is. If you look at pairs + i
instead of pairs + i + 1
you will be able to see the key which is associated with the value from the DOCINFO pdfmark.
You won't be able to see anything 'suspicious' by looking at the file in Acrobat, because Acrobat will translate the UTF16BE into whatever your system requires in order to display the text correctly. It may even be that this is ASCII, you can still represent that as UTF16.
If you open the file in a text editor you may be able to see the relevant string (note that the BOM in Ghostscript is in octal, so that's 0xFE 0xFF in hexadecimal), provided its not in a compressed object stream.
Examining the source of latest ghostscript (9.50), it seems that the PDFACompatibilityPolicy
values in this case (see devices/vector/gdevpdfm.c
around line 1951) set the error-containing behavior as such:
So, in my case, the whole thing was solved simply by setting
-dPDFACompatibilityPolicy=3
Ghostscript does not complain, does not abort PDF/A output, does not discard the PDFINFO, and, most importantly, veraPDF checker still verifies the PDF as perfectly okay.
I'm not commenting on how ugly this solution is, but it works just great. Since all other switch statements just assume compatibility policy 0
if anything above 2 gets passed in, this "shortcut" seems to be an unintended, but very useful bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With