Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to iterate over all the objects in a PDF page and check which ones are text objects?

Tags:

python

pypdf

I want to iterate over all the objects in a page of a pdf using pypdf.

I also want to check that what is the type of the object, whether it is text or graphics.

A code snippet would be a great help.

Thanks a lot

like image 620
Shan Avatar asked Oct 20 '12 08:10

Shan


1 Answers

I think that PyPDF is not the correct tool for the job. What you need is to parse the page itself (for which PyPDF has limited support, see the API documentation), and then to be able to save the results in another PDF object after changing some of the objects.

You might decompress the PDF using pdftk, and this would allow you to use pdfrw.

However, from what you write,

My ultimate goal is to color each text object differently.

a "text object" may be a quite complex object made up of (e.g.) different lines in different paragraphs. This might be, and you might see it as, a single entity. In this entity there might be already be several different text-color commands.

For example you might have a single stream with this sequence of text (this is written in "internal" language):

12.84 0 Td(S)Tj
0.08736 Tc
9 0 Td(e)Tj
0.06816 Tc
0.5 g
7.55999 0 Td(qu)Tj
0.08736 Tc
1 g
16.5599 0 Td(e)Tj
0.06816 Tc
7.55999 0 Td(n)Tj
0.08736 Tc
8.27996 0 Td(c)Tj
-0.03264 Tc
0.13632 Tw
7.55999 0 Td(e )Tj
0.06816 Tc
0 Tw

This might write "Sequence". It's actually made up of seven text subobjects, and there is no library I know of that can "decrypt" the stream into its component subobjects, much less assign to them the proper attributes (which in PDF descend from graphics state, while in any hierarchical structure such as XML would probably be associated to the single node, maybe through inheritance).

More: the stream might include non-text commands (e.g. lines). Then changing the "text" stroking color would actually change also non textual objects' color.

A library should provide you a level of detail access similar to that achieved by directly reading the text stream; so doing this through a library seems unlikely.

Since this is word processing work, you might look into the possibility of converting PDF to OpenOffice (using the PDF Import extension), manipulating it through OOo python, then exporting it back to PDF from within OpenOffice itself.

Beware, however, for there be dragons: the documentation is sketchy, and the interface is sometimes unstable. Accessing "text" might not be practical (the more so, since text will be available to you only on a line by line basis).

Another possibility (again, not for the faint of heart) is to decode the PDF yourself. Start by getting it in uncompressed format through pdftk. This will yield a header followed by a stream of objects in the form

INDEX R obj
<<
COMMANDS OR DATA
>>
[ stream 
STREAM OF TEXT
endstream ]
endobj

You can read the stream, and for each object:

  1. If COMMANDS OR DATA is only /Length length, it is likely a text stream. Else GOTO 3.
  2. Parse the object (see below). If length changes, remember to update /Length appropriately.
  3. Note the current output file offset, save it in XREF[i] ("reference offset for the i-th object"), and save it to the output file.

At the end of objects you will find a XREF object, wherein each object is indicated with the file offset at which it resides. These offsets (10-digits numbers) will have to be rewritten according to the new offsets you saved in XREF. The start of this object shall go into the startxref at the end of the PDF file.

(To debug, start by writing a routine that copies all objects without modifications. It must recalculate xrefs and offsets, and still yield a PDF object identical to the original).

The PDF thus obtained can be recompressed by pdftk to save space.

As regards the PDF textual object parsing, you basically check it line by line looking for text output commands (See PDF Reference 5.3.2). Mostly the command you'll see will be Tj:

9.95999 0 Td(Hello, world)Tj

and color changing commands (see #4.5.1; again the most used are g and rg.)

1 g             # Sets color to black (1 in colorspace Gray)
1 0 0 rg        # Sets color to red (1,0,0 in colorspace RGB)

You will then keep track of whatever color we're using, and might for example include each Tj command between a couple of RG commands of your choosing - one that sets your text color, one that restores the original. This way you will be sure that the graphic state does not "spill" to any nearby objects, lines, etc.; it will increase the object Length and also make the resulting PDF a little bit slower (but not very much. You might not even notice).

like image 72
LSerni Avatar answered Oct 12 '22 08:10

LSerni