I am working on parser for PDF (text extraction). When page need to be Flate Decoded (from zlib compression), my code is able to decompress content streams, and then I have output (stream object) something like below: <pre class="prettyprint"><code>BT 56.8 721.3 Td /F2 12 Tf [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ ET </code></pre> I am interested in the string array (operand of TJ). It seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make sense. Instead it appears a sequence like 010203... sort of lz77 compression. <ul> <li>Do PDFs have multiple levels of compression?</li> <li>How can I get plain text from above string array? </li> </ul>

Before you start an ambitious project like this, you should make yourself familiar with the complete official PDF-1.7 specification. Be warned: this is a 756 page document, and it refers to about 90 other documents, which it declares to be also "normative" for PDF. You will learn, that in order to reverse the PDF source code to text contents, you have to reverse-apply the encoding used by the font. There are 5 spec-defined standard encodings which may be used: <ol> <li><code>StandardEncoding</code></li> <li><code>MacRomanEncoding</code></li> <li><code>WinAnsiEncoding</code></li> <li><code>PDFDocEncoding</code></li> <li><code>MacExpertEncoding</code></li> </ol> On top of that, there can also be a <code>CustomEncoding</code> (which comes into play when the embedded font is a subset, and does not contain all glyphs defined by the font, but only those glyphs required by the document). You can only reverse a CustomEncode-d text, if there is a <code>/ToUnicode</code> table defined inside the PDF. Only then you'll be able to reverse-map the encoded characters to character names. You will also learn, that there is not only one, but there are four operators that can be used to show text strings: <ol> <li> <code>Tj</code> : "Show text" </li> <li> <code>TJ</code> : "Show text, allowing individual glyph positioning" </li> <li> <code>'</code> : "Move to next line and show text" </li> <li> <code>"</code> : "Set word and character spacing, move to next line, and show text" </li> </ol> Moreover, there are three different ways to represent text strings. Here given as examples for the string "string": <ol> <li> <code>(string)</code> : This uses standard printable ASCII characters (only possible for Latin/ASCII text parts) inside parentheses.</li> <li> <code>(\163\164\162\151\156\147)</code> : This uses octal character codes (also inside parentheses), as listed in "Annex D (normative) Character Sets and Encodings" of the specification document.</li> <li> <code><737472696E67></code> : This uses hex-encoded character codes inside angle brackets.</li> </ol> The problems for the text extractor are the following: <ol> <li> Using printable ASCII characters (<code>1.</code> above) and octal character codes (<code>2.</code> above) can be mixed. All of the following are also "legal" representations of the string "string" (listing not complete!): <pre class="prettyprint"><code> (\163tring)Tj (\163\164\162\151\156g) Tj (st\162i\156g) Tj ... </code></pre> </li> <li> Using hex-encoded character codes (<code>3.</code> above) is also not straight forward, because all of the following representations are equivalent: <pre class="prettyprint"><code><73 74 72 69 6E 67> TJ <73 7472 696E67> TJ <7 374 7 269 6E 67>TJ <73 74 72696E 67> TJ <73 74 7 2 69 6E 67> TJ </code></pre> </li> </ol> For more weirdness allowed by the PDF spec (or tolerated by the Adobe viewers) see also for example: <ul> <li> PDF Tricks (by Ange Albertini of @corkami fame)</li> </ul> I myself have recently created a little series of hand-coded PDF files which demonstrate how a missing, an incorrect, a manipulated or a correct <code>/ToUnicode</code> table do influence the outcome of any PDF-to-Text reversing: <ul> <li> Why text extracting doesn't work for all PDFs (This same repository contains some more study material in the form of hand-coded PDFs which highlight other parts and operators of the PDF syntax.)</li> </ul> <hr> Finally, looking at the small snippet of PDF source code the OP provided: <pre class="prettyprint"><code>BT 56.8 721.3 Td /F2 12 Tf [<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ ET </code></pre> <ul> <li><code>BT</code> and <code>ET</code> indicate the beginning and end of a text showing section</li> <li><code>56.8 721.3 Td</code> positions the current point to coordinates "56.8 points in horizontal, 721.3 points in vertical direction".</li> <li><code>12 Tf</code> sets the font size to 12 points.</li> <li><code>/F1</code> sets the font to be use to one that is defined elsewhere in the PDF document. That font also somewhere sets a font encoding (and possibly a <code>/ToUnicode</code> table). The font encoding will determine which glyph shape should be drawn when a specific character code is seen in the text strings.</li> <li><code>[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ</code></li> </ul> This last part can be dissected into these parts: <ul> <li> <code><01>2</code> : <code><01></code> is the first character code. <code>2</code> is a parameter for the "individual glyph positioning" allowed when using the text show operator <code>TJ</code>.</li> <li> <code><0203>2</code> : <code><0203></code> are two more character codes. <code>2</code> again is a parameter for the "individual glyph positioning" for <code>TJ</code>. </li> <li> <code><04>-10</code> : <code><04></code> is the fourth character code. <code>-10</code> again for the "individual glyph positioning" with <code>TJ</code>. </li> <li> <code><0503>2</code> : <code><05></code> is the fifth character code, <code><03></code> is the third character code (used before). <code>2</code> is for "individual glyph positioning"...</li> <li>etc.</li> </ul> Individual glyph positioning: The individual glyph positioning works like this: <ul> <li> Positive numbers shift the next glyph to the left (decreasing glyph spacing to next glyph).</li> <li> Negative numbers shift the next glyph to the right (adding more space to next glyph).</li> <li>The numbers themselves are to be taken as representing one thousandths of the current unit.</li> </ul> Meaning of character codes: To know the meaning of first, second, third, ... last character codes, you'll have to lookup these in the <code>/ToUnicode</code> table of your PDF. If it does not have embedded such a table, then bad luck! Check easy extractability of text: To check if your PDF lends itself easily to text extraction, you could use the command line tool <code>pdffonts</code>. Here is an example output: <pre class="prettyprint"><code>$ pdffonts sample.pdf name type encoding emb sub uni object ID ------------------------- ------------- ------------ --- --- --- --------- IADKRB+Arial-BoldMT CID TrueType Identity-H yes yes yes 10 0 SSKFGJ+ArialMT CID TrueType Custom yes yes no 11 0 </code></pre> In the above example case, the subsetted font <code>SSKFGJ+ArialMT</code> uses a custom encoding, but the PDF has no <code>/ToUnicode</code> for this font, as indicated by the column headed <code>uni</code>. Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).

Encoding of PDF text string

Tags:

I am working on parser for PDF (text extraction).

When page need to be Flate Decoded (from zlib compression), my code is able to decompress content streams, and then I have output (stream object) something like below:

BT
56.8 721.3 Td 
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET

I am interested in the string array (operand of TJ).

It seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make sense. Instead it appears a sequence like 010203... sort of lz77 compression.

Do PDFs have multiple levels of compression?
How can I get plain text from above string array?

941

asked Apr 06 '15 08:04

duckduckgo

2 Answers

Before you start an ambitious project like this, you should make yourself familiar with the complete official PDF-1.7 specification. Be warned: this is a 756 page document, and it refers to about 90 other documents, which it declares to be also "normative" for PDF.

You will learn, that in order to reverse the PDF source code to text contents, you have to reverse-apply the encoding used by the font. There are 5 spec-defined standard encodings which may be used:

StandardEncoding
MacRomanEncoding
WinAnsiEncoding
PDFDocEncoding
MacExpertEncoding

On top of that, there can also be a CustomEncoding (which comes into play when the embedded font is a subset, and does not contain all glyphs defined by the font, but only those glyphs required by the document). You can only reverse a CustomEncode-d text, if there is a /ToUnicode table defined inside the PDF. Only then you'll be able to reverse-map the encoded characters to character names.

You will also learn, that there is not only one, but there are four operators that can be used to show text strings:

Tj : "Show text"
TJ : "Show text, allowing individual glyph positioning"
' : "Move to next line and show text"
" : "Set word and character spacing, move to next line, and show text"

Moreover, there are three different ways to represent text strings. Here given as examples for the string "string":

(string) : This uses standard printable ASCII characters (only possible for Latin/ASCII text parts) inside parentheses.
(\163\164\162\151\156\147) : This uses octal character codes (also inside parentheses), as listed in "Annex D (normative) Character Sets and Encodings" of the specification document.
<737472696E67> : This uses hex-encoded character codes inside angle brackets.

The problems for the text extractor are the following:

Using printable ASCII characters (1. above) and octal character codes (2. above) can be mixed. All of the following are also "legal" representations of the string "string" (listing not complete!):
```
 (\163tring)Tj
 (\163\164\162\151\156g) Tj
 (st\162i\156g)  Tj
 ...
```
Using hex-encoded character codes (3. above) is also not straight forward, because all of the following representations are equivalent:
```
<73 74 72 69 6E 67> TJ

<73 7472 696E67> TJ

<7 374 7 269 6E 67>TJ

<73 74 72696E 67> TJ

<73
 74 7
 2 69 6E 67>
TJ
```

For more weirdness allowed by the PDF spec (or tolerated by the Adobe viewers) see also for example:

PDF Tricks (by Ange Albertini of @corkami fame)

I myself have recently created a little series of hand-coded PDF files which demonstrate how a missing, an incorrect, a manipulated or a correct /ToUnicode table do influence the outcome of any PDF-to-Text reversing:

Why text extracting doesn't work for all PDFs
(This same repository contains some more study material in the form of hand-coded PDFs which highlight other parts and operators of the PDF syntax.)

Finally, looking at the small snippet of PDF source code the OP provided:

BT
56.8 721.3 Td 
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET

BT and ET indicate the beginning and end of a text showing section
56.8 721.3 Td positions the current point to coordinates "56.8 points in horizontal, 721.3 points in vertical direction".
12 Tf sets the font size to 12 points.
/F1 sets the font to be use to one that is defined elsewhere in the PDF document. That font also somewhere sets a font encoding (and possibly a /ToUnicode table). The font encoding will determine which glyph shape should be drawn when a specific character code is seen in the text strings.
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ

This last part can be dissected into these parts:

<01>2 : <01> is the first character code. 2 is a parameter for the "individual glyph positioning" allowed when using the text show operator TJ.
<0203>2 : <0203> are two more character codes. 2 again is a parameter for the "individual glyph positioning" for TJ.
<04>-10 : <04> is the fourth character code. -10 again for the "individual glyph positioning" with TJ.
<0503>2 : <05> is the fifth character code, <03> is the third character code (used before). 2 is for "individual glyph positioning"...
etc.

Individual glyph positioning: The individual glyph positioning works like this:

Positive numbers shift the next glyph to the left (decreasing glyph spacing to next glyph).
Negative numbers shift the next glyph to the right (adding more space to next glyph).
The numbers themselves are to be taken as representing one thousandths of the current unit.

Meaning of character codes: To know the meaning of first, second, third, ... last character codes, you'll have to lookup these in the /ToUnicode table of your PDF. If it does not have embedded such a table, then bad luck!

Check easy extractability of text: To check if your PDF lends itself easily to text extraction, you could use the command line tool pdffonts. Here is an example output:

$ pdffonts sample.pdf
  name                      type          encoding     emb sub uni object ID
  ------------------------- ------------- ------------ --- --- --- ---------
  IADKRB+Arial-BoldMT       CID TrueType  Identity-H   yes yes yes     10  0
  SSKFGJ+ArialMT            CID TrueType  Custom       yes yes no      11  0

In the above example case, the subsetted font SSKFGJ+ArialMT uses a custom encoding, but the PDF has no /ToUnicode for this font, as indicated by the column headed uni. Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).

181

answered Oct 06 '22 00:10

Kurt Pfeifle

Abhishek,

This is far from an easy question and unfortunately it shows you have not read the PDF specification. You should do so.

You can download the Acrobat SDK here: http://www.adobe.com/devnet/acrobat/sdk/eula.html

Part of that is the PDF Specification which is a very hefty document explaining the ins and outs of PDF (including the answer to your question).

In short - and not as a substitute to reading the documentation - what you're looking at are character values in the encoding of the font set by the /F2 12 Tf command which sets a particular font used when writing text subsequently.

answered Oct 06 '22 00:10

David van Driessche

Related questions
                            
                                Select columns of data.table based on regex
                            
                                How to get future date in Faker
                            
                                Does checking the Never ask again box when asking for a runtime permission disable future dialogs?
                            
                                Java - Read line using InputStream [duplicate]
                            
                                Swift add show action to button programmatically
                            
                                How to add close icon in Material UI Dialog Header top right corner
                            
                                How can I remove default button class of a dataTables button?
                            
                                How to change Angular 5 Material input placeholder? [duplicate]
                            
                                Laravel how to response only 204 code status with no body message
                            
                                Interleave 4 lists of same length python [duplicate]
                            
                                How to execute and get content of a .php file in a variable?
                            
                                Simulating location updates on the iPhone Simulator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With