I am working on parser for PDF (text extraction).
When page need to be Flate Decoded (from zlib compression), my code is able to decompress content streams, and then I have output (stream object) something like below:
BT
56.8 721.3 Td
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET
I am interested in the string array (operand of TJ).
It seems like there are multiple hex encoded strings contained in this array but corresponding hex values do not make sense. Instead it appears a sequence like 010203... sort of lz77 compression.
PDF files are either 8-bit binary files or 7-bit ASCII text files (using ASCII-85 encoding). Every line in a PDF can contain up to 255 characters. Every line ends with a carriage return, a line feed or a carriage return followed by a line feed (depending upon the application or platform used to create the PDF file).
PDF character encoding determines the character set that is used to create PDF files. You can choose to use Windows1252 encoding, the standard Microsoft Windows operating system single-byte encoding for Latin text in Western writing systems, or unicode (UTF-16) encoding.
Thus, a PDF never is UTF-8 encoded.
Before you start an ambitious project like this, you should make yourself familiar with the complete official PDF-1.7 specification. Be warned: this is a 756 page document, and it refers to about 90 other documents, which it declares to be also "normative" for PDF.
You will learn, that in order to reverse the PDF source code to text contents, you have to reverse-apply the encoding used by the font. There are 5 spec-defined standard encodings which may be used:
StandardEncoding
MacRomanEncoding
WinAnsiEncoding
PDFDocEncoding
MacExpertEncoding
On top of that, there can also be a CustomEncoding
(which comes into play when the embedded font is a subset, and does not contain all glyphs defined by the font, but only those glyphs required by the document). You can only reverse a CustomEncode-d text, if there is a /ToUnicode
table defined inside the PDF. Only then you'll be able to reverse-map the encoded characters to character names.
You will also learn, that there is not only one, but there are four operators that can be used to show text strings:
Tj
: "Show text"
TJ
: "Show text, allowing individual glyph positioning"
'
: "Move to next line and show text"
"
: "Set word and character spacing, move to next line, and show text"
Moreover, there are three different ways to represent text strings. Here given as examples for the string "string":
(string)
: This uses standard printable ASCII characters (only possible for Latin/ASCII text parts) inside parentheses.(\163\164\162\151\156\147)
: This uses octal character codes (also inside parentheses), as listed in "Annex D (normative) Character Sets and Encodings" of the specification document.<737472696E67>
: This uses hex-encoded character codes inside angle brackets.The problems for the text extractor are the following:
Using printable ASCII characters (1.
above) and octal character codes (2.
above) can be mixed. All of the following are also "legal" representations of the string "string" (listing not complete!):
(\163tring)Tj
(\163\164\162\151\156g) Tj
(st\162i\156g) Tj
...
Using hex-encoded character codes (3.
above) is also not straight forward, because all of the following representations are equivalent:
<73 74 72 69 6E 67> TJ
<73 7472 696E67> TJ
<7 374 7 269 6E 67>TJ
<73 74 72696E 67> TJ
<73
74 7
2 69 6E 67>
TJ
For more weirdness allowed by the PDF spec (or tolerated by the Adobe viewers) see also for example:
I myself have recently created a little series of hand-coded PDF files which demonstrate how a missing, an incorrect, a manipulated or a correct /ToUnicode
table do influence the outcome of any PDF-to-Text reversing:
Finally, looking at the small snippet of PDF source code the OP provided:
BT
56.8 721.3 Td
/F2 12 Tf
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
ET
BT
and ET
indicate the beginning and end of a text showing section
56.8 721.3 Td
positions the current point to coordinates "56.8 points in horizontal, 721.3 points in vertical direction".
12 Tf
sets the font size to 12 points.
/F1
sets the font to be use to one that is defined elsewhere in the PDF document. That font also somewhere sets a font encoding (and possibly a /ToUnicode
table). The font encoding will determine which glyph shape should be drawn when a specific character code is seen in the text strings.
[<01>2<0203>2<04>-10<0503>2<04>-2<0506070809>2<0A>1<0B>]TJ
This last part can be dissected into these parts:
<01>2
: <01>
is the first character code. 2
is a parameter for the "individual glyph positioning" allowed when using the text show operator TJ
.<0203>2
: <0203>
are two more character codes. 2
again is a parameter for the "individual glyph positioning" for TJ
. <04>-10
: <04>
is the fourth character code. -10
again for the "individual glyph positioning" with TJ
. <0503>2
: <05>
is the fifth character code, <03>
is the third character code (used before). 2
is for "individual glyph positioning"...Individual glyph positioning: The individual glyph positioning works like this:
Meaning of character codes: To know the meaning of first, second, third, ... last character codes, you'll have to lookup these in the /ToUnicode
table of your PDF. If it does not have embedded such a table, then bad luck!
Check easy extractability of text: To check if your PDF lends itself easily to text extraction, you could use the command line tool pdffonts
. Here is an example output:
$ pdffonts sample.pdf
name type encoding emb sub uni object ID
------------------------- ------------- ------------ --- --- --- ---------
IADKRB+Arial-BoldMT CID TrueType Identity-H yes yes yes 10 0
SSKFGJ+ArialMT CID TrueType Custom yes yes no 11 0
In the above example case, the subsetted font SSKFGJ+ArialMT
uses a custom encoding, but the PDF has no /ToUnicode
for this font, as indicated by the column headed uni
. Hence it is not easy to extract text that is shown with this font (extraction would require manual reverse engineering -- but then you can also just "read" the PDF pages).
Abhishek,
This is far from an easy question and unfortunately it shows you have not read the PDF specification. You should do so.
You can download the Acrobat SDK here: http://www.adobe.com/devnet/acrobat/sdk/eula.html
Part of that is the PDF Specification which is a very hefty document explaining the ins and outs of PDF (including the answer to your question).
In short - and not as a substitute to reading the documentation - what you're looking at are character values in the encoding of the font set by the /F2 12 Tf command which sets a particular font used when writing text subsequently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With