I am working on improving the pdf scrubber in the ApprovalTests framework and looking at a simple pdf generated with PdfSharp I see that it's contents are as follows.
Does anyone know what the ID field toward the bottom is?
%PDF-1.4
%ÓôÌá
1 0 obj
<<
/CreationDate(D:20131119194420-06'00')
/Creator(PDFsharp 1.32.3057-g \(www.pdfsharp.net\))
/Producer(PDFsharp 1.32.3057-g \(www.pdfsharp.net\))
>>
endobj
2 0 obj
<<
/Type/Catalog
/Pages 3 0 R
>>
endobj
3 0 obj
<<
/Type/Pages
/Count 1
/Kids[4 0 R]
>>
endobj
4 0 obj
<<
/Type/Page
/MediaBox[0 0 612 792]
/Parent 3 0 R
/Contents 5 0 R
/Resources
<<
/ProcSet [/PDF/Text/ImageB/ImageC/ImageI]
/ExtGState
<<
/GS0 6 0 R
>>
/Font
<<
/F0 8 0 R
>>
>>
/Group
<<
/CS/DeviceRGB
/S/Transparency
/I false
/K false
>>
>>
endobj
5 0 obj
<<
/Length 99
/Filter/FlateDecode
>>
stream
xœŠI
€@ïyE¼)¸ÄŒ^—«ðŽ
2"êÍ×)ènšº ER¢¿ÊŠq>t¡¼pA-t#áö@ÒªÄú¯À†ã¢R7#ç(ý~qîq:og½
endstream
endobj
6 0 obj
<<
/Type/ExtGState
/ca 1
>>
endobj
7 0 obj
<<
/Type/FontDescriptor
/Ascent 1005
/CapHeight 727
/Descent -210
/Flags 32
/FontBBox[-550 -303 1707 1072]
/ItalicAngle 0
/StemV 0
/XHeight 548
/FontName/Verdana,Bold
>>
endobj
8 0 obj
<<
/Type/Font
/Subtype/TrueType
/BaseFont/Verdana,Bold
/Encoding/WinAnsiEncoding
/FontDescriptor 7 0 R
/FirstChar 0
/LastChar 255
/Widths[1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 341 402 587 867 710 1271 862 332 543 543 710 867 361 479 361 689 710 710 710 710 710 710 710 710 710 710 402 402 867 867 867 616 963 776 761 723 830 683 650 811 837 545 555 770 637 947 846 850 732 850 782 710 681 812 763 1128 763 736 691 543 689 543 867 710 710 667 699 588 699 664 422 699 712 341 402 670 341 1058 712 686 699 699 497 593 455 712 649 979 668 650 596 710 543 710 867 1000 710 1000 332 710 587 1048 710 710 710 1777 710 543 1135 1000 691 1000 1000 332 332 587 587 710 710 1000 710 963 593 543 1067 1000 596 736 341 402 710 710 710 710 543 710 710 963 597 849 867 479 963 710 587 867 597 597 710 721 710 361 710 597 597 849 1181 1181 1181 616 776 776 776 776 776 776 1093 723 683 683 683 683 545 545 545 545 830 846 850 850 850 850 850 867 850 812 812 812 812 736 734 712 667 667 667 667 667 667 1018 588 664 664 664 664 341 341 341 341 679 712 686 686 686 686 686 867 686 712 712 712 712 650 699 650]
>>
endobj
xref
0 9
0000000000 65535 f
0000000015 00000 n
0000000180 00000 n
0000000228 00000 n
0000000283 00000 n
0000000538 00000 n
0000000707 00000 n
0000000750 00000 n
0000000935 00000 n
trailer
<<
/ID[<48189AA5E6D2394D8EF6E7842493B4A9><48189AA5E6D2394D8EF6E7842493B4A9>]
/Info 1 0 R
/Root 2 0 R
/Size 9
>>
startxref
2167
%%EOF
1 In Acrobat, click the Edit menu and choose Preferences > Signatures . 2 On the right, click More for Identities & Trusted Certificates . 3 Select Digital IDs on the left, and then click the Add ID button . 4 Select the ... See More...
The /Info is the PDF document’s information directory that is contained in object number 15. The /ID array is required because the Encrypt entry is present and contains two strings that constitute a file identifier. Those two strings are used as input to the encryption algorithm.
After you convert your document to a PDF form, do one of the following: Choose a form field from the toolbar. Right-click the page and select a field. Your cursor becomes a crosshair, and displays a preview of the field. On the page, click where you want to add the field to create a field with the default size.
Acrobat stores the digital ID information in a file, which has a .pfx extension in Windows and .p12 in Mac OS. The files can be used interchangeably between operating systems.
Some remarks to add to the picture from @Millie's answer:
When in doubt about some aspects of PDF, the first place to look should be the specification ISO 32000-1.
It specifies the ID entry as:
ID array (Required if an Encrypt entry is present; optional otherwise; PDF 1.1)
An array of two byte-strings constituting a file identifier (see 14.4, "File Identifiers") for the file. If there is an Encrypt entry this array and the two byte-strings shall be direct objects and shall be unencrypted.
NOTE 1 Because the ID entries are not encrypted it is possible to check the ID key to assure that the correct file is being accessed without decrypting the file. The restrictions that the string be a direct object and not be encrypted assure that this is possible.
NOTE 2 Although this entry is optional, its absence might prevent the file from functioning in some workflows that depend on files being uniquely identified.
NOTE 3 The values of the ID strings are used as input to the encryption algorithm. If these strings were indirect, or if the ID array were indirect, these strings would be encrypted when written. This would result in a circular condition for a reader: the ID strings must be decrypted in order to use them to decrypt strings, including the ID strings themselves. The preceding restriction prevents this circular condition.
(Table 15 – Entries in the file trailer dictionary)
NOTE 2 above in essence is a recommendation to add this optional value even though it is not formulated using the SHALL/SHOULD/MAY specification language conventions applied elsewhere in this document.
The recommendation is more explicit in the referenced section 14.4:
The ID entry is optional but should be used.
As should in these specifications denotes a recommendation and a recommendation is defined as something one has to do unless there are good reasons not to, this means a PDF writer has to create this entry unless it can argue against the requirement (I can hardly think of arguments to use against that). This should answer the question asked in response to Millie's answer
any idea why both PdfSharp and phantomjs create it?
Especially it is not just considered good practice as assumed in another comment above.
Concerning the contents of the ID array, the specification continues in section 14.4:
The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the file at the time it was originally created and shall not change when the file is incrementally updated. The second byte string shall be a changing identifier based on the file’s contents at the time it was last updated. When a file is first written, both identifiers shall be set to the same value. If both identifiers match when a file reference is resolved, it is very likely that the correct and unchanged file has been found. If only the first identifier matches, a different version of the correct file has been found.
To help ensure the uniqueness of file identifiers, they should be computed by means of a message digest algorithm ...
The calculation of the file identifier need not be reproducible; all that matters is that the identifier is likely to be unique.
Thus, the first article Millie quoted from is not entirely correct when it claims
the file identifier (the /ID entry from the trailer dictionary). This is an arbitrary string of bytes
The value of the ID entry is not a string but instead an array of two strings. And the string values are not arbitrary but instead unique values recommended to be obtained by hashing. Thus they especially must not be re-used for different documents which would be ok if they were merely arbitrary.
The other article quoted from also is not entirely correct saying
a program that makes PDF files is only required to create the file identifier if the file is to be encrypted.
Even when not encrypting, that program has to have good reasons not to create file identifiers as it's a recommendation in the specification. Lacking such reasons, therefore, a program is required to create the file identifier.
This all being said, any PDF consumer always has to be prepared to find a PDF without file identifier... there might be a reason for not creating it after all.
According to this article:
4. Append the file identifier (the /ID entry from the trailer
dictionary). This is an arbitrary string of bytes; Adobe
recommends that it be generated by MD5 hashing various pieces
of information about the document.
That was talking about the encryption of PDFs. According to this article, the ID is only needed during encryption:
a program that makes PDF files is only required to create the file
identifier if the file is to be encrypted.
This SO link also has some good info. It states that the ID only needs to be reasonably unique, and gives the specific ISO number to find more info.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With