Unicode in PDF

Tags:

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:

Click to copy

(something)

There is also the option to escape a character with octal codes:

Click to copy

(\527)

but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.

Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.

The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.

Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.

Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

288

asked Sep 24 '08 16:09

Marcus Downing

1 Answers

In the PDF reference in chapter 3, this is what they say about Unicode:

Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is described in the Unicode Standard by the Unicode Consortium (see the Bibliography). For text strings encoded in Unicode, the first two bytes must be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard. (This mechanism precludes beginning a string using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to be a meaningful beginning of a word or phrase).

112

answered Sep 17 '22 14:09

plinth

Related questions
                            
                                Convert: Postscript delegate failed
                            
                                Display Pdf in browser using express js
                            
                                c# itextsharp PDF creation with watermark on each page
                            
                                Place image over PDF
                            
                                Save multiple sheets to .pdf
                            
                                Ruby: Reading PDF files
                            
                                pandoc doesn't text-wrap code blocks when converting to pdf
                            
                                Limitations on opening pdf file in Android
                            
                                Create pdf from html in golang
                            
                                Best Server-side .NET PDF editing library [closed]
                            
                                Firefox Links to local or network pages do not work
                            
                                Determine the number of pages in a PDF file
                            
                                Wicked PDF ignores bootstrap grid system
                            
                                PDF compare on linux command line
                            
                                TCPDF output without saving file
                            
                                Convert DOCX to PDF programmatically without Word installed? [closed]
                            
                                How do I make Org-mode open PDF files in Evince?
                            
                                Generate PDF with images from HTML in Swift without displaying print interface
                            
                                Edit *existing* PDF in a browser
                            
                                How to specify parameters to google chrome adobe pdf viewer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode in PDF

Tags:

pdf

pdf-generation

unicode

utf-8

Marcus Downing

People also ask

1 Answers

plinth

Recent Activity

Donate For Us