Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird characters when filling PDF with PDFTk

I'm using php with PDFTK on Ubuntu. When filling a PDF with data, I get weird characters for this letters with accents: á ó í. I'm using UTF-8 encoding: I checked with echo mb_check_encoding($var, 'UTF-8') which outputs 1 - TRUE. Any idea what I can do?

I also tried converting to ISO with utf8_decode, but still, no luck.

Thanks

like image 640
sergio Avatar asked Dec 16 '22 13:12

sergio


1 Answers

You're right, utf8_decode() will work for characters which can be encoded as Windows-1252 (i.e. U+0000–U+00FF).

However it won't work for characters which can't be encoded in Windows-1252.

You can always encode characters using UTF-16BE, though. You can do this for a single field only, e.g. to encode the word "özil":

<<
/V (þÿ^@ö^@z^@i^@l)
/T (name)
>>

(Here the "^@" indicates a NUL character (U+0000). This is how it looks in my editor (vim), if the file is encoded in Windows-1252 (latin1).)

Note that you need to use a byte order mark (which will appear as "þÿ" if your file is encoded in Windows-1252) and you'll need to encode the entire string (between the two parentheses) in UTF-16.

If you're generating the FDF in a PHP script you can do something like this:

<<
/V (<?php echo chr(0xfe) . chr(0xff) . str_replace(array('\\', '(', ')'), array('\\\\', '\(', '\)'), mb_convert_encoding("özil", 'UTF-16BE')); ?>)
/T (name)
>>

You can also write out the hex codes like this (i.e. enclosed in angular brackets rather than parentheses):

<<
/V <FEFF00F6007A0069006C>
/T (name)
>>

This has exactly the same result (the string "özil"). It's less efficient in terms of characters, but it actually seems to be more reliable in pdftk, which has some bugs I've found (in version 2.02).

Finally, you can also write out the Unicode code point for any character in octal notation (\ddd). For example, ö has codepoint U+00F6, which in octal is 366, so you can write:

<<
/V (\366zil)
/T (name)
>>

However, this only works up to U+00FF (octal 377). Beyond that, you'd have to use UTF-16.

The PDF standard allows you to set the encoding to UTF-8 for the whole FDF document. I tried this and it didn't work with pdftk, however in theory it would be done like this:

%FDF-1.2
1 0 obj
<<
/Version /1.3
/Encoding /utf_8
/FDF

(You would presumably have to set the FDF version to 1.3 (or more) in the header too, according to the standard.)

You can also do this at the field level:

<<
/V (özil)
/T (name)
/Encoding /utf_8
>>

But as I said, I didn't manage to get any of this to work. pdftk just seems to ignore it.

like image 107
user2829228 Avatar answered Dec 19 '22 02:12

user2829228