Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text from PDF document based on position c++

Tags:

c++

pdf

podofo

I am trying to extract a text from a PDF document based on it's coordinates, so I have came across two notions in the Adobe PDF Reference (chap. 5.3):

  1. Text positioning operators
  2. Text showing operators

For now I am interested in Td & Tm positioning operators, while using Td you have tx and ty, relative to start of the current line which is clearly specified in a PDF document: tx ty Td, I have used this method to extract text by the tx and ty coordinates. The problem is that I don't know how to extract text from a PDF based on its position, while supplying only tx and ty.

a b c d e f Tm

this is the 'formula for' Tm usage. What does the a-f values represent ? This would be my input for Tm:

BT
/F1 8.88 Tf
0 0 0 rg
0.9998 0 0 1 401.52 448.08 Tm
[<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ
ET

Why does each group of four have a leading 00 ? is this in hex? should I convert it from hex to int and corresponding character?

this would be my input for Td:

BT 43.20 421.90 Td 0 Tw /C001 10.00 Tf 0.00 Tw <BlablablaTextInHexThatICanProcess>Tj ET

This is much clearer, the coordinates are clearer. How could extract the text from a Tm positioned PDF text object based on simple X and Y coordinates? I am using c++ and PoDoFo library

like image 307
AlexandruC Avatar asked May 09 '13 13:05

AlexandruC


2 Answers

First of all, when trying to extract text from a PDF based on its position, while supplying only tx and ty, it does not suffice to only consider the text matrix (which you set using the Tm operator you already found). You also have to consider the current transformation matrix!

I assume when you refer to a position as given in default user space coordinates.

To avoid the device-dependent effects of specifying objects in device space, PDF defines a device-independent coordinate system that always bears the same relationship to the current page, regardless of the output device on which printing or displaying occurs. This device-independent coordinate system is called user space.

The user space coordinate system shall be initialized to a default state for each page of a document. The CropBox entry in the page dictionary shall specify the rectangle of user space corresponding to the visible area of the intended output medium (display window or printed page). The positive x axis extends horizontally to the right and the positive y axis vertically upward

(section 8.3.2.3, ISO 32000-1:2008)

As we only see the x and y coordinates, we see the position as a vector (x, y) in R². Internally, though, PDFs consider this plane embedded in R³ with a constant z coordinate value 1, i.e. [x, y, 1]. This is because PDF wants to allow numerous kinds of transformations (translations, rotations, scaling, skewing, ...) but on the other hand wants to limit the required mathematical operations as far as possible. Incidentally after embedding our plane as [x, y, 1] into R³ all these transformations are possible by means of matrix multiplications:

Single transformation

Here you already see those numbers a, b, c, d, e, and f you asked about.

Now, before taking the text specific transformations into account, you have to take into account the manipulations of the current (text independent) transformation matrix. This matrix is manipulated by the cm operators:

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(section 8.4.4, ISO 32000-1:2008)

This implies, BTW, that you have to consider all cm operators currently in action, i.e. all presented since the start of the page content, with the exception of those revoked by restoring a former graphics state (cf. the operators q and Q pushing and restoring graphic states, section 8.4.2, ISO 32000-1:2008).

Only now you can consider the text specific transformation matrices:

At the beginning of a text object, Tm shall be the identity matrix; therefore, the origin of text space shall be initially the same as that of user space. The text-positioning operators, described in Table 108, alter Tm and thereby control the placement of glyphs that are subsequently painted. Also, the text-showing operators, described in Table 109, update Tm (by altering its e and f translation components) to take into account the horizontal or vertical displacement of each glyph painted as well as any character or word-spacing parameters in the text state.

Additionally, within a text object, a conforming reader shall keep track of a text line matrix, Tlm, which captures the value of Tm at the beginning of a line of text. The text-positioning and text-showing operators shall read and set Tlm on specific occasions mentioned in Tables 108 and 109

(section 9.4.2, ISO 32000-1:2008)

Thus, inside of a text object you have to keep track of the text matrix which primarily is set using the Tm operator you found with the operands arranged in the matrix as shown above but which also is manipulated as an effect of other text positioning and text showing operators.

And there still are additional parameters determining the final position of the text, the text state parameters Tfs (the text font size), Th (the horizontal scaling), and Trise (the text rise), cf. section 9.3.1, ISO 32000-1:2008.

Conceptually, the entire transformation from text space to device space [or in your case to the default user space] may be represented by a text rendering matrix, Trm:

Text rendering matrix

Trm is a temporary matrix; conceptually, it is recomputed before each glyph is painted during a text-showing operation.

(section 9.4.2, ISO 32000-1:2008)

Thus, your coordinates (x, y) conceptually result from the text space coordinates by multiplication with Trm:

[x, y, 1] = [xts, yts, 1] x Trm

where (xts, yts) are (0, 0) at the glyphs origin. For every glyph printed you have a glyph displacement to get to the point where the next glyph origin will be positioned:

glyph displacement

The text matrix shall be updated by these glyph displacement values as follows:

Text matrix update by glyph displacement

(section 9.4.4, ISO 32000-1:2008)

I quoted a number of paragraphs from the current PDF specification ISO 32000-1:2008. I gather this is preferable to using the PDF Reference 1.4 which es quite ancient; furthermore it has been called "not normative in nature" by Adobe personal.

EDIT Some clarifications in answer to comments

device space and user space, what is the distinction between them, isn't the device space reffering to printer/ video display? and user space to a way of overcoming every device's particularities? like the user page being the document page that I see?

Yes, the device space is a fixed coordinate system essentially determined by the properties of the device at hand. And yes, the user space is a coordinate system independant from the target device. But no, it is not "the document page you see" because you see it on some device (or after being processed by some device).

The user space coordinate system is an independent coordinate system the coordinates of of a point of which can be translated to the device coordinates by means of a matrix multiplication with the current transformation matrix (CTM).

UserCoords x CTM = DeviceCoords

The user space coordinate system is initialized to a state where the CropBox entry in the page dictionary specifies the rectangle of user space corresponding to the visible area (see above) by initializing the CTM accordingly.

But as the choice of words already indicates ("current transformation matrix", "the coordinate system is initialized"), the user space coordinate system is a dynamic, everchanging coordinate system.

The default user space provides a consistent, dependable starting place for PDF page descriptions regardless of the output device used. If necessary, a PDF content stream may modify user space to be more suitable to its needs by applying the coordinate transformation operator, cm (see 8.4.4, "Graphics State Operators"). Thus, what may appear to be absolute coordinates in a content stream are not absolute with respect to the current page because they are expressed in a coordinate system that may slide around and shrink or expand. Coordinate system transformation not only enhances device-independence but is a useful tool in its own right.

(section 8.3.2.3, ISO 32000-1:2008)

Thus, when a PdfReader stumbles upon a cm operator with its parameters representing some matrix M, the CTM changes:

CTMnew = M x CTMold

and coordinates present in following operators are interpreted according to this new matrix CTMnew:

UserCoords x CTMnew = DeviceCoords

So now the user space coordinate system might be very different from its former state, scaled, rotated, skewed, whatever.

The coordinates you are essentially interested in most likely are those in the coordinate system the user space is initialized as, i.e. the device coordinate system for a virtual device for which the CTM is initialized as identity matrix.

where does text space and glyph space start and end.

The coordinates of text are specified in text space. The transformation from text space to user space are defined by a text matrix in combination with several text-related parameters in the graphics state (see 9.4.2, "Text-Positioning Operators").

The text matrix TM is initialized as the identity matrix at the start of a text object but changes during the execution of text operations, most visibly when you use the Tm operator, implicitly when you use others. This matrix is manipulated by a matrix TR containing the text-related parameters font size, horizontal scaling, and text rise. For details see the text rendering matrix TRM above. Thus,

DeviceCoords = UserCoords x CTM = TextCoords x TR x TM x CTM

The transformation from glyph space to text space shall be defined by the font matrix. For most types of fonts, this matrix shall be predefined to map 1000 units of glyph space to 1 unit of text space; for Type 3 fonts, the font matrix shall be given explicitly in the font dictionary (see 9.6.5, "Type 3 Fonts").

Thus, this transformation depends on the current font. The font matrix FM from the font dictionary would act like this:

DeviceCoords = GlyphCoords x FM x TR x TM x CTM

You do not want to locate the device coordinates of a single segment of a glyph, so these coordinates do not seem to interest. The glyph widths, though, are to be interpreted in glyph space. Unless you are dealing with Type 3 fonts, though, this merely means that you have to divide them by 1000...

And how does parameters w0 and w1 evolve during glyph painting? are they initially (0,0)

w0 and w1 denote the glyph's horizontal and vertical displacements. In horizontal writing mode, w0 is the glyph widths transformed to text mode (i.e. most often merely divided by 1000) and w1 is 0. For vertical writing mode text inspect sections 9.2.4 and 9.7.4.3 in ISO 32000-1:2008.

does text space have the same origin as the first glyph space? and they get updated with the calculated (tx,ty)?

As the glyph space coordinates are merely multiplied by the font matrix to result in text space coordinates and the font matrix in all cases but for Type 3 fonts merely compresses by a factor of 1000, see above, the glyph origin is mapped to the text space origin.

But tx and ty are used to update the text matrix itself. Thus, the text spece coordinate system moves for each glyph and for each (non-Type 3) glyph origin maps to origin... of a slightly changed text space coordinate system.

like image 72
mkl Avatar answered Oct 20 '22 15:10

mkl


Do not underestimate the scale of this task. The text matrix bit is pretty simple and straightforward. The difficult bit is the text itself.

Let's start with your query - why does each group of four have a leading 00?

Well PDF doesn't have a standard text encoding - it has lots and lots and lots. You need to know what the encoding is for the font before you can decode the text.

So in your example:

BT
/F1 8.88 Tf
0 0 0 rg
0.9998 0 0 1 401.52 448.08 Tm
[<0014>-11<0015>-11<0013>-11<000F>-19<0014>-11<0019>] TJ
ET

The font is the /F1 bit. This is a name that exists in the Page (or parents of) that relates to a font. You need to look up the font and find out what the encoding is.

Given the content in your example I suspect that the encoding is an identity one and that the four digit hex numbers are glyph IDs within the font. If this is the case then the font should have a ToUnicode entry which will allow you to look up the glyph ID and get back a Unicode character.

Other fonts may or may not have ToUnicode entries and if this occurs there are a variety of ways you can extract the Unicode text. Different methods may give different results which is why the PDF spec has an entire section entitled "Extraction of Text Content" detailing the order in which these should be attempted.

Hopefully your PoDoFo library should have methods to do this kind of conversion. If not the task will be quite hard and I think you should consider some other options. I wrote the text extraction code for our ABCpdf .NET library and it took some months to code followed by some years of tweaking.

like image 4
OnceUponATimeInTheWest Avatar answered Oct 20 '22 16:10

OnceUponATimeInTheWest