How can I determine the extent (in bytes) of page 1 in a linearized PDF file?

Tags:

I know that I can 'linearize' a PDF file, for example using the Acrobat SDK or using commercial tools. This is also called 'optimized for web', and it rearranges the PDF so that page 1 can load as quickly as possible. PDFs served in this way are displayed more quickly, because the PDF viewer doesn't have to wait for the whole PDF to be downloaded.

Update: based on answer below, I now realize that a linearized PDF is not just rearranged, but also contains metadata about its own structure, in the form of the "linearization dictionary".

I have an application where I want to prefetch several PDFs (the results of a query) in anticipation that the user will want to see one of them. It would be awesome if my client could download page 1, and only page 1, for each of the search results. When the user selects one of them, page 1 can be displayed instantly, and the remainder can be downloaded in the background.

I'm looking for a general solution that can be used server-side (Windows or Linux) to preprocess my PDFs, so that I can store and serve page 1 and the remainder separately. Really, all I need to know is where in the PDF is the last byte needed to properly display page 1. If I can have this number, all else follows.

I have browsed the ISO specification for PDF but the file format seems too complex for me to simply parse out where page 1 ends. On the other hand, the tools that linearize PDFs must almost certainly know where page 1 ends.

I am not interested in the complications of serving PDFs in pieces to the clients; this part is already solved since the client is an app, not a browser, and I have full control.

I also don't think it will help me to split the PDF using tools like AP Split into a "page 1" PDF and a complete PDF. If I do, then I will not be able to fool the client viewer into thinking it is a single PDF file, and there will be noticeable flicker when I replace the "page 1" PDF with the full PDF.

Any help or pointers appreciated.

Solution (based on Bobrovsky's answer below):

A properly linearized PDF begins with a header line (defined in section 7.5.2 of the PDF spec) such as "%PDF-1.7" followed by a comment line of at least four binary characters (defined as byte values of 128 or higher). For example:

    %PDF-1.7
    %¤¤¤¤

This header is immediately followed by the linearization dictionary (defined in Appendix F in the PDF spec). An example:

    43 0 obj
    << /Linearized 1.0 % Version
     /L 54567   % File length
     /H [475 598] % Primary hint stream offset and length (part 5)
     /O 45      % Object number of first page’s page object (part 6)
     /E 5437    % Offset of end of first page
     /N 11      % Number of pages in document
     /T 52786 % Offset of first entry in main cross-reference table (part 11)
    >>
    endobj

In this example, the end of the first page is at byte offset 5437. This data structure is simple enough to parse using pretty much any language. The "43 0 obj" thing gives an ID for this dictionary (43) and a generation number (always zero for linearized files). The dictionary itself is surrounded by << and >>, between which are key value pairs (keys have slashes like "/E").

And here's a C# method that finds the relevant number using a regex:

public int GetPageOneLength(byte[] data)
{
  // According to ISO PDF spec: "The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file" (p. 679)
  string preamble = new string(ASCIIEncoding.ASCII.GetChars(data, 0, 1024));    // Note that the binary section on line 2 of the header will be entirely converted to question martks ('?')
  var match = Regex.Match(preamble, @"<<\w*/Linearized.+/E\s+(?<offset>\d+).+>>");
  if (!match.Success) throw new InvalidDataException("PDF does not have a proper linearization dictionary");
  return int.Parse(match.Groups["offset"].Value);
}

Note Bobrovsky's caveat that a file may contain the linearization dictionary, yet may not be properly linearized (perhaps because of an incremental edit?). In my case, this is not a problem, as I will linearize all the PDFs myself.

393

asked Apr 10 '12 22:04

Sten L

1 Answers

Linearization dictionary should help you with this.

The dictionary required to contain E parameter that is

The offset of the end of the first page (the end of part 6 in Example F.1), relative to the beginning of the file.

Please note that not every file with a linearization dictionary is actually linearized (broken generators, changes after linearization etc.) So, you might not be able to use described approach if your files are not verified to be properly linearized.

Please have a look at F.2.2 Linearization Parameter Dictionary (Part 2) in PDF Reference for more information about linearization dictionary.

143

answered Nov 15 '22 04:11

Bobrovsky

Related questions
                            
                                Fast Repeat TakeWhile causes infinite loop
                            
                                POST JSON Dictionary without Key/Value Text
                            
                                Embedded Mono: Keeping references to C# objects in C++
                            
                                Using F# Option Type in C#
                            
                                Error when trying to connect to Oracle 10g database from C# program employing minimal set-up configuration
                            
                                Storing C# datetime to postgresql TimeStamp
                            
                                facebook c# sdk getting started
                            
                                Binding to X Y coordinates of element on WPF Canvas
                            
                                Saving an Image file to sql Server and converting byte array into image
                            
                                Index pdf documents in Solr from C# client
                            
                                C# deep/nested/recursive merge of dynamic/expando objects
                            
                                Create a X509Certificate2 from RSACryptoServiceProvider fails with Cannot find the requested object
                            
                                How are components removed with Castle 3.0?
                            
                                How to tell ASP.Net MVC that all incoming dates deserialized from JSon should be UTC?
                            
                                CamelCase breaking change in Json.NET version 4
                            
                                Why is it invalid to assign an integer value to a uint parameter in a C# method argument?
                            
                                ASP.NET Cache class vs. MemoryCache class
                            
                                how to write libraries without forcing users to use the library's IOC container
                            
                                C# close standard out
                            
                                precautions to take to prevent memory leaks due to added event handles

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With