I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.
My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
in HTML.
Thank you very much in advance, Blz
PDF character encoding determines the character set that is used to create PDF files. You can choose to use Windows1252 encoding, the standard Microsoft Windows operating system single-byte encoding for Latin text in Western writing systems, or unicode (UTF-16) encoding.
Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.
A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With