I'm trying to extract some textual data from a PDF file. To do this, I need a sense of where some text is printed on the page, so I can correlate locations of different pieces of data. However, I'm getting stuck because I don't fully understand the behavior of the text matrix set by the Tm operator.
Tm (0.0, -5.28, 5.28, 0.0, 429.7006, 803.9603)
rg (0.617, 0.098, 0.043)
Tj '\x01'
Tm (0.0, -9.0, 9.0, 0.0, 428.1406, 784.8203)
rg (0.0, 0.219, 0.512)
Tc (2.4756,)
Tj '4567'
This is some of the stream content. As you can see, it has two Tm calls, closely together. All the normal text is printed in the Tm (0.0, -9.0, 9.0, 0.0) space -- it appears like the -5.28/5.28 space is just used to print some special characters. Now, I know that the latter two parameters to Tm are used to set the current location to a new one, but it appears these numbers are dependent on more context (probably the 5.28 and 9.0 scales, somehow). I can't seem to figure out how all this fits together, though, and the spec (page 250 has the Tm "explanation") seems spectacularly unhelpful to me.
EDIT: extended example, why this has me flummoxed:
Tm 0 -27 27 0 545.5606 817.2203
(rg, Tc, Tw, Tj, Tf omitted)
TD 0.0156 -1.2556
Tm 0 -9 9 0 441.9406 677.4803
TD 10.6733 0 # more omitted, including other TD ops with second param 0
TD -82.7267 -1.5333 # start of a new line
Tc 0
Tj (3)
Tf /F2 1
Tm 0 -5.28 5.28 0 429.7006 803.9603
Tj ()
Tf /TT2 1
Tm 0 -9 9 0 428.1406 784.8203
Tc 2.4756
Tj (4567) # these appear on the same line as before the double Tm
In my initial code I assumed that the e and f parameters to Tm and the parameters to TD were in the same space, leading to organized coordinates. However, that fails here: the 4567 in the last Tj shows up in the same line as the earlier 3, while the y coordinate has gone from 677.4803 + -1.5333 = 675.947, but after the final Tm, the y axis coordinate seems to be set to 784.8203; suggesting that "4567" should be drawn above the 3.
The text matrix is combined with the current transformation matrix in order to set the text position. Your text is placed at (429.7006, 803.9603) and at (428.1406, 784.8203). The text size is 5.28 and 9 points. It is a common technique to set the font size to 1 using the Tf operator and set the actual font size by scaling the text matrix. Your text is also rotated.
A correct calculation of text position requires to parse the entire content stream and execute all q, Q, cm, Tf, Tm and all the other text related operators.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With