Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding of PDF text string

Tags:

I am working on parser for PDF (text extraction).

When page need to be Flate Decoded (from zlib compression), my code is able to decompress content streams, and then I have output (stream object) something like below:

BT
56.8 721.3 Td 
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET

I am interested in the string array (operand of TJ).

It seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make sense. Instead it appears a sequence like 010203... sort of lz77 compression.

  • Do PDFs have multiple levels of compression?
  • How can I get plain text from above string array?
like image 941
duckduckgo Avatar asked Apr 06 '15 08:04

duckduckgo


People also ask

How is PDF text encoded?

PDF files are either 8-bit binary files or 7-bit ASCII text files (using ASCII-85 encoding). Every line in a PDF can contain up to 255 characters. Every line ends with a carriage return, a line feed or a carriage return followed by a line feed (depending upon the application or platform used to create the PDF file).

What encoding is used in PDF?

PDF character encoding determines the character set that is used to create PDF files. You can choose to use Windows1252 encoding, the standard Microsoft Windows operating system single-byte encoding for Latin text in Western writing systems, or unicode (UTF-16) encoding.

Does PDF use UTF-8?

Thus, a PDF never is UTF-8 encoded.


2 Answers

Before you start an ambitious project like this, you should make yourself familiar with the complete official PDF-1.7 specification. Be warned: this is a 756 page document, and it refers to about 90 other documents, which it declares to be also "normative" for PDF.

You will learn, that in order to reverse the PDF source code to text contents, you have to reverse-apply the encoding used by the font. There are 5 spec-defined standard encodings which may be used:

  1. StandardEncoding
  2. MacRomanEncoding
  3. WinAnsiEncoding
  4. PDFDocEncoding
  5. MacExpertEncoding

On top of that, there can also be a CustomEncoding (which comes into play when the embedded font is a subset, and does not contain all glyphs defined by the font, but only those glyphs required by the document). You can only reverse a CustomEncode-d text, if there is a /ToUnicode table defined inside the PDF. Only then you'll be able to reverse-map the encoded characters to character names.

You will also learn, that there is not only one, but there are four operators that can be used to show text strings:

  1. Tj : "Show text"
  2. TJ : "Show text, allowing individual glyph positioning"
  3. ' : "Move to next line and show text"
  4. " : "Set word and character spacing, move to next line, and show text"

Moreover, there are three different ways to represent text strings. Here given as examples for the string "string":

  1. (string) : This uses standard printable ASCII characters (only possible for Latin/ASCII text parts) inside parentheses.
  2. (\163\164\162\151\156\147) : This uses octal character codes (also inside parentheses), as listed in "Annex D (normative) Character Sets and Encodings" of the specification document.
  3. <737472696E67> : This uses hex-encoded character codes inside angle brackets.

The problems for the text extractor are the following:

  1. Using printable ASCII characters (1. above) and octal character codes (2. above) can be mixed. All of the following are also "legal" representations of the string "string" (listing not complete!):

     (\163tring)Tj
     (\163\164\162\151\156g) Tj
     (st\162i\156g)  Tj
     ...
    
  2. Using hex-encoded character codes (3. above) is also not straight forward, because all of the following representations are equivalent:

    <73 74 72 69 6E 67> TJ
    
    <73 7472 696E67> TJ
    
    <7 374 7 269 6E 67>TJ
    
    <73   74    72696E 67> TJ
    
    <73
      74 7
      2 69 6E 67>
    TJ
    

For more weirdness allowed by the PDF spec (or tolerated by the Adobe viewers) see also for example:

  • PDF Tricks (by Ange Albertini of @corkami fame)

I myself have recently created a little series of hand-coded PDF files which demonstrate how a missing, an incorrect, a manipulated or a correct /ToUnicode table do influence the outcome of any PDF-to-Text reversing:

  • Why text extracting doesn't work for all PDFs
    (This same repository contains some more study material in the form of hand-coded PDFs which highlight other parts and operators of the PDF syntax.)

Finally, looking at the small snippet of PDF source code the OP provided:

BT
56.8 721.3 Td 
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET
  • BT and ET indicate the beginning and end of a text showing section

  • 56.8 721.3 Td positions the current point to coordinates "56.8 points in horizontal, 721.3 points in vertical direction".

  • 12 Tf sets the font size to 12 points.

  • /F1 sets the font to be use to one that is defined elsewhere in the PDF document. That font also somewhere sets a font encoding (and possibly a /ToUnicode table). The font encoding will determine which glyph shape should be drawn when a specific character code is seen in the text strings.

  • [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ

This last part can be dissected into these parts:

  • <01>2 : <01> is the first character code. 2 is a parameter for the "individual glyph positioning" allowed when using the text show operator TJ.
  • <0203>2 : <0203> are two more character codes. 2 again is a parameter for the "individual glyph positioning" for TJ.
  • <04>-10 : <04> is the fourth character code. -10 again for the "individual glyph positioning" with TJ.
  • <0503>2 : <05> is the fifth character code, <03> is the third character code (used before). 2 is for "individual glyph positioning"...
  • etc.

Individual glyph positioning: The individual glyph positioning works like this:

  • Positive numbers shift the next glyph to the left (decreasing glyph spacing to next glyph).
  • Negative numbers shift the next glyph to the right (adding more space to next glyph).
  • The numbers themselves are to be taken as representing one thousandths of the current unit.

Meaning of character codes: To know the meaning of first, second, third, ... last character codes, you'll have to lookup these in the /ToUnicode table of your PDF. If it does not have embedded such a table, then bad luck!

Check easy extractability of text: To check if your PDF lends itself easily to text extraction, you could use the command line tool pdffonts. Here is an example output:

$ pdffonts sample.pdf
  name                      type          encoding     emb sub uni object ID
  ------------------------- ------------- ------------ --- --- --- ---------
  IADKRB+Arial-BoldMT       CID TrueType  Identity-H   yes yes yes     10  0
  SSKFGJ+ArialMT            CID TrueType  Custom       yes yes no      11  0

In the above example case, the subsetted font SSKFGJ+ArialMT uses a custom encoding, but the PDF has no /ToUnicode for this font, as indicated by the column headed uni. Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).

like image 181
Kurt Pfeifle Avatar answered Oct 06 '22 00:10

Kurt Pfeifle


Abhishek,

This is far from an easy question and unfortunately it shows you have not read the PDF specification. You should do so.

You can download the Acrobat SDK here: http://www.adobe.com/devnet/acrobat/sdk/eula.html

Part of that is the PDF Specification which is a very hefty document explaining the ins and outs of PDF (including the answer to your question).

In short - and not as a substitute to reading the documentation - what you're looking at are character values in the encoding of the font set by the /F2 12 Tf command which sets a particular font used when writing text subsequently.

like image 28
David van Driessche Avatar answered Oct 06 '22 00:10

David van Driessche