Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing PDF file using Regular expressions in Python

I am trying to parse some object elements from a PDF file using re module of Python. My goal is to parse each PDF object using a regular expression. A PDF object example is the following:

1 0 obj
<<
    /Type /Catalog
    /Pages 2 0 R
>>
endobj
2 0 obj
<<
    /Type /Pages
    /Kids [ 3 0 R ]
    /Count 1
>>
endobj
...

When I use "\d+\s\d+\sobj[\s,\S]*endobj" it doesn't work (it keeps parsing util last endobj is found). How can I modify regular expression in order to parse each object seperately (in other words the part from 1 0 obj until endobj)?

like image 927
Iketani Kouichiro Avatar asked Dec 09 '22 13:12

Iketani Kouichiro


2 Answers

If you are using only regex, it is easy to construct a PDF file that your program will not be able to handle. PDF dictionaries and lists can contain other objects. Regex can't handle recursive structures, at least not Python re module.

A pdf file is a tree of objects and streams:

  • Dictionaries: << (name value)* >>
  • Lists: [ (value)* ]
  • Names: / (regular char)*
  • Strings: ( (char)* )
  • Hex strings: < (hexchar)* >
  • Numbers: (-)? ((digit)+ | (digit)+ . (digit)* | . (digit)+)
  • Booleans: true | false
  • References: (digit)+ (whitespace)+ (digit)+ (whitespace)+ R

Whitespace and comments are ignored in most places. Comments start with % and run until the end of the line.

Indirect objects are specified as:

1 0 obj
(any object)
endobj

This object can then be referenced as 1 0 R. Indirect dictionaries can also have a stream attached:

1 0 obj
<<
/Length 22
>>
stream
(22 bytes of raw data)
endstream
endobj

A PDF file looks something like this:

%PDF-1.4
%ÿÿÿÿ
1 0 obj
<< /Author (MizardX) >>
endobj
2 0 obj
<<
/Type /Catalog
% more required keys
>>
endobj
%lots of more indirect objects, one after another
trailer
<<
/Info 1 0 R
/Root 2 0 R
% ... more required keys
>>
xref
0 3
0000000000 65535 f
0000000015 00000 n
0000000054 00000 n
startxref
225
%%EOF

The root of the object tree is the trailer object. Every objects is referenced directly or indirectly from this dictionary.

There are a lot more complexity hidden inside the streams, but that does not affect the file structure.

The full specification can be found at Adobe's website.

like image 61
Markus Jarderot Avatar answered Dec 29 '22 05:12

Markus Jarderot


You need to use *?as the non-greedy version - see documentation here.

Also, note that PDF format is very complex - especially when it starts having binary streams within it - but if you know the PDFs you are looking at are simple then this should work.

like image 42
neil Avatar answered Dec 29 '22 05:12

neil