Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using PDFBox to write UTF-8 encoded strings to a PDF [duplicate]

I am having trouble writing unicode characters out to a PDF using PDFBox. Here is some sample code that generates garbage characters instead of outputting "š". What can I add to get support for UTF-8 strings?

PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);

PDType1Font font = PDType1Font.HELVETICA;
contentStream.setFont(font, 12);
contentStream.beginText();
contentStream.moveTextPositionByAmount(100, 400);
contentStream.drawString("š");
contentStream.endText();
contentStream.close();
document.save("test.pdf");
document.close();
like image 327
Lucas Moellers Avatar asked Mar 24 '11 20:03

Lucas Moellers


People also ask

What is the UTF-8 encoded string in PDF?

In 2017, PDF 2.0 introduced UTF-8 encoded strings as an additional format for PDF text strings, while maintaining full backward-compatible support for the existing UTF-16BE and PDFDocEncoded text string definitions.

How to write text to a PDF file using PDFBox?

Following are the programatical steps required to create and write text to a PDF file using PDFBox 2.0 : Step 2: Create a PDF page. Step 3: Add the page to the PDF document. Step 4: Ready the contents to be written in the page. Use a stream. This stream has to be closed after usage.

What is pdfdocencoding and how does it work?

PDFDocEncoding is a predefined text encoding unique to PDF. It supports a superset of the ISO Latin 1 character set which happens, as Adobe’s PDF Reference 1.2 puts it, to be “ compatible with Unicode in that all Unicode codes less than 256 match PDFDocEncoding ” ( Adobe PDF 1.2, p.47).

What is a text string in a PDF file?

In PDF “text strings” are a formal subtype of strings as illustrated in Figure 7 from ISO 32000-2: Text strings in PDF are intended for character strings that could be presented to a human, such as in a graphical user interface or in the output from command-line utilities.


1 Answers

You are using one of the inbuilt 'Base 14' fonts that are supplied with Adobe Reader. These fonts are not Unicode; they are effectively a standard Latin alphabet, though with a couple of extra characters. It looks like the character you mention, a lowercase s with a caron (š), is not available in PDF Latin text... though an uppercase Š is available but curiously on Windows only. See Appendix D of the PDF specification at http://www.adobe.com/devnet/pdf/pdf_reference.html for details.

Anyway, getting to the point... you need to embed a Unicode font if you want to use Unicode characters. Make sure you are licensed to embed whatever font you decide on... I can recommend the open-source Gentium or Doulos fonts because they're free, high quality and have comprehensive Unicode support.

like image 152
gutch Avatar answered Sep 19 '22 16:09

gutch