Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the names of AcroField revisions in iText?

I'm currently working on a tool for applying and verifying digital signatures on pdf documents. I understand that each signature in a pdf document is applied to a certain revision of the AcroFields. Each time the user changes some input (i.e. by filling out a pdf form), a new revision is created.

My question is: how can I retrieve all revisions from an AcroFields object? I know how to get a single revision if I have its name like this:

AcroFields fields = ...;
fields.getRevision("revision1");

But how can I retrieve all revisions (or their names, at least)? I haven't found anything in the iText API and the web so far.

I'm using iText in version 2.1.7.

Thanks and Regards, Tom

like image 851
Tom Avatar asked Jun 18 '13 12:06

Tom


People also ask

How are acrofields read and saved in the database?

The AcroFields are read from the document and saved to the database using a parameterized query, AcroFields are across the whole document and are not referenced per page. The following field data is saved in the database if the field has data: File name, Field name, Field value, Field type, File type.

Do revisions have a name in iText?

In contrast to your assumption, these revisions do not have a name per se. Unless you use the second part of the ID (which should be different for different revisions) that is, but AFAIK that is not used as a name for anything in iText. There is some imprecision in the exact point where a trailer stops and the next body update starts.

What is a revision of the acrofields?

These states are commonly named revisions of the document, and each revision of the document obviously reflects some state of the form information of the PDF which I assume is what you call revision of the AcroFields.

What is the difference between number and acrofields?

“ Number: (?<one>.+)Rev: ” – find characters between Number: and Rev: and assign to group “one” must be re-written as - “ Number: (?<one>>.+)Rev: ” The AcroFields are read from the document and saved to the database using a parameterized query, AcroFields are across the whole document and are not referenced per page.


1 Answers

Some backgrounds first

Normally a PDF file consists of a header, a body, cross reference information, and a trailer, see figure 2 below. When updating such a PDF file, you have the choice of

  • either building the whole document anew with all the changes integrated (which results in a PDF again formed like the original one)
  • or you can append the changes of body elements and cross references to the document and add a new trailer also referencing the former trailer (which results in a PDF formed like depicted in figure 3 below).

Actually there are some forms in-between, though. E.g. some tools merely cut off the cross references and the trailer of the original document and then add their new or changed body elements, new complete cross references, and a new trailer without any back reference to some former state.

Initial structure of a PDF fileStructure of an updated PDF file

(Images copied from the PDF specification ISO 32000-1:2008)

In case of PDFs formed like in figure 3, we have a history of different states of the PDF at hand, each starting at the start of the file and reaching up to and including one of the trailers. These states are commonly named revisions of the document, and each revision of the document obviously reflects some state of the form information of the PDF which I assume is what you call revision of the AcroFields.

In contrast to your assumption, these revisions do not have a name per se. Unless you use the second part of the ID (which should be different for different revisions) that is, but AFAIK that is not used as a name for anything in iText.

There is some imprecision in the exact point where a trailer stops and the next body update starts. On one hand there are some choices imminent in the specification (different possible line breaks, ignored white space, ignored comment lines) and on the other hand many PDF producers are a bit beyond the specification anyways. This in combination with the in-between varieties between full and incremental updates mentioned above can make the process of extracting the revisions somewhat troublesome sometimes.

There is a special case of revisions which can be recognized with a great degree of reliability: Signed revisions, i.e. revisions whose last body update contains an integrated signature for the document. As the signed byte ranges of a document must encompass all of the document revision but the gap left for the signature itself (at least to be accepted by Adobe software and to be conform to PAdES and PDF-2 standards), the exact end of the revision in that case can be deduced from the signature information:

Multiple integrated signatures

More details here.

Some answers to your questions

I understand that each signature in a pdf document is applied to a certain revision of the AcroFields.

As mentioned above, each is applied to a certain revision of the document which then implies a certain state or "revision" of the form data.

Each time the user changes some input (i.e. by filling out a pdf form), a new revision is created.

Not necessarily. As mentioned above there are many in-between methods for updates.

Only if changing the information of a document whose latest revision is signed, a proper incremental update is necessary if that signature is not to be removed or invalidated. Otherwise an updater can take all information added after the last signature, create an own update with any content he wishes, and append that update to the last signed revision of the document. This update may even contain multiple virtual update blocks with the intention of making you believe certain intermediate revisions actually existed.

Thus, only signed revisions can be somehow trusted to be true. iText only provides access to such signed revisions.

My question is: how can I retrieve all revisions from an AcroFields object?

You can extract all the signed document revisions using

InputStream revisionStream = fields.extractRevision("name");

and open them in separate PdfReader instances. Then you can access the PDF form information of each of these signed revisions by querying the AcroFields instance of the respective PdfReader opened for that revision.

(BTW, the String argument is not the name of the revision but the name of the signature field whose signature signs that revision.)

But how can I retrieve all revisions (or their names, at least)? I haven't found anything in the iText API and the web so far.

As mentioned before, those revision names actually are signature field names. Thus, you can use

List<String> names = fields.getSignatureNames()

to retrieve all names for which a revision can be extracted.

like image 168
mkl Avatar answered Dec 13 '22 20:12

mkl