Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox “special” characters in Helvetica

Tags:

java

pdfbox

I'm using PDFBox 2.0.0-SNAPSHOT to build a PDF in Java. It is working fine for very basic characters (e.g. [a-zA-Z9-0]) but I'm getting encoding errors for slightly more advanced characters such as (quoteright). Here's my code:

PDDocument pdf = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
pdf.addPage(page);

PDPageContentStream contents = new PDPageContentStream(pdf, page);
PDFont font = PDType1Font.HELVETICA;
contents.beginText();
contents.setFont(font, 12);

// ...

String text = "’";
contents.showText(text);

contents.endText();
contents.close();

I get this exception:

Can't encode U+2019 in font Helvetica. Type 1 fonts only support 8-bit code points

I looked up the supported characters for non-embedded fonts in Section D.1 of the PDF specification, and this character should be supported.

Indeed, if I use this trick, I can insert the correct character:

// ...

// String text = "’";
// contents.showText(text);
byte[] commands = "(x) Tj ".getBytes();
commands[1] = (byte)145;    // = 221 octal = quoteright in WinAnsi
contents.appendRawCommands(commands);

// ...

But this isn't really a practical solution. Aside from the inconvenience of manually searching for every character that might be in the string, the appendRawCommands method is now deprecated.

So, what's going on here? From the answer from above it is implied that showText should not have the issues present with the old drawString method, but something clearly isn't working.

EDIT: As requested in the comments, here is the full stack trace of the exception:

Exception in thread "main" java.lang.IllegalArgumentException: Can't encode U+2019 in font Helvetica. Type 1 fonts only support 8-bit code points
    at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:343)
    at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:285)
    at org.apache.pdfbox.pdmodel.font.PDFont.getStringWidth(PDFont.java:314)
    at com.fatfractal.test.PDFBoxTest.textWidth(PDFBoxTest.java:148)
    at com.fatfractal.test.PDFBoxTest.showFlowingTextAt(PDFBoxTest.java:128)
    at com.fatfractal.test.PDFBoxTest.build(PDFBoxTest.java:73)
    at com.fatfractal.test.PDFBoxTest.main(PDFBoxTest.java:97)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
like image 631
shawkinaw Avatar asked Nov 10 '15 18:11

shawkinaw


2 Answers

Looking at the PDFBox code, it really seems like a bug. If you look at the PDType1Font.encode() method, it automatically throws if the code point is > 0xFF. However, if the logic instead proceeded in this case, the GlyphList would convert the "\u2019" character to "quoteright", which would then be a valid character in the font.

like image 148
jtahlborn Avatar answered Nov 12 '22 04:11

jtahlborn


As @jtahlborn explained in his answer, PDType1Font.encode() is broken in the current 2.0.0 release candidate.

In contrast to the 1.x.x PDPageContentStream method drawString, though, the 2.0.0 release candidate method showText is encoding aware.

As a work-around, therefore, you could use a composite font with subset embedding instead, e.g. on a standard MS Windows installation:

InputStream fontStream = new FileInputStream("c:/Windows/Fonts/ARIALUNI.TTF");
PDType0Font font = PDType0Font.load(pdf, fontStream);

Using this font your code will not fail for "’" because composite font classes do not have the bug observed in PDType1Font here.

like image 42
mkl Avatar answered Nov 12 '22 03:11

mkl