Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode in PDF

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:

(something) 

There is also the option to escape a character with octal codes:

(\527) 

but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.


Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.

The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.


Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.


Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

like image 288
Marcus Downing Avatar asked Sep 24 '08 16:09

Marcus Downing


People also ask

Does PDF use Unicode?

1 Special Characters in PDF DocumentsThe supported encoding values are: Identity-H (Default value; Unicode encoding for horizontal writing) Identity-V (Unicode encoding for vertical writing)

What is Unicode PDF?

In the PDF reference in chapter 3, this is what they say about Unicode: Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D.

How do I use Unicode in Adobe?

Select File > Utilities > Hex Input to display the Hex Input palette. Type the Unicode number of the character you want to insert. The corresponding character is displayed on the right. To toggle between Unicode character sets, click UTF 32.


1 Answers

In the PDF reference in chapter 3, this is what they say about Unicode:

Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is described in the Unicode Standard by the Unicode Consortium (see the Bibliography). For text strings encoded in Unicode, the first two bytes must be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard. (This mechanism precludes beginning a string using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to be a meaningful beginning of a word or phrase).

like image 112
plinth Avatar answered Sep 17 '22 14:09

plinth