Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a field in which PDF files specify their encoding?

Tags:

pdf

unicode

utf

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.

My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.

Thank you very much in advance, Blz

like image 371
Louis Thibault Avatar asked May 18 '12 16:05

Louis Thibault


People also ask

What is the encoding of a PDF file?

PDF character encoding determines the character set that is used to create PDF files. You can choose to use Windows1252 encoding, the standard Microsoft Windows operating system single-byte encoding for Latin text in Western writing systems, or unicode (UTF-16) encoding.

How do I know the encoding of a file?

Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As...". Whatever the default-selected encoding is, that is what your current encoding is for the file.


1 Answers

A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.

like image 143
Mattias Wadman Avatar answered Sep 19 '22 21:09

Mattias Wadman