Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I determine the extent (in bytes) of page 1 in a linearized PDF file?

Tags:

c#

pdf

I know that I can 'linearize' a PDF file, for example using the Acrobat SDK or using commercial tools. This is also called 'optimized for web', and it rearranges the PDF so that page 1 can load as quickly as possible. PDFs served in this way are displayed more quickly, because the PDF viewer doesn't have to wait for the whole PDF to be downloaded.

Update: based on answer below, I now realize that a linearized PDF is not just rearranged, but also contains metadata about its own structure, in the form of the "linearization dictionary".

I have an application where I want to prefetch several PDFs (the results of a query) in anticipation that the user will want to see one of them. It would be awesome if my client could download page 1, and only page 1, for each of the search results. When the user selects one of them, page 1 can be displayed instantly, and the remainder can be downloaded in the background.

I'm looking for a general solution that can be used server-side (Windows or Linux) to preprocess my PDFs, so that I can store and serve page 1 and the remainder separately. Really, all I need to know is where in the PDF is the last byte needed to properly display page 1. If I can have this number, all else follows.

I have browsed the ISO specification for PDF but the file format seems too complex for me to simply parse out where page 1 ends. On the other hand, the tools that linearize PDFs must almost certainly know where page 1 ends.

I am not interested in the complications of serving PDFs in pieces to the clients; this part is already solved since the client is an app, not a browser, and I have full control.

I also don't think it will help me to split the PDF using tools like AP Split into a "page 1" PDF and a complete PDF. If I do, then I will not be able to fool the client viewer into thinking it is a single PDF file, and there will be noticeable flicker when I replace the "page 1" PDF with the full PDF.

Any help or pointers appreciated.

Solution (based on Bobrovsky's answer below):

A properly linearized PDF begins with a header line (defined in section 7.5.2 of the PDF spec) such as "%PDF-1.7" followed by a comment line of at least four binary characters (defined as byte values of 128 or higher). For example:

    %PDF-1.7
    %¤¤¤¤

This header is immediately followed by the linearization dictionary (defined in Appendix F in the PDF spec). An example:

    43 0 obj
    << /Linearized 1.0 % Version
     /L 54567   % File length
     /H [475 598] % Primary hint stream offset and length (part 5)
     /O 45      % Object number of first page’s page object (part 6)
     /E 5437    % Offset of end of first page
     /N 11      % Number of pages in document
     /T 52786 % Offset of first entry in main cross-reference table (part 11)
    >>
    endobj

In this example, the end of the first page is at byte offset 5437. This data structure is simple enough to parse using pretty much any language. The "43 0 obj" thing gives an ID for this dictionary (43) and a generation number (always zero for linearized files). The dictionary itself is surrounded by << and >>, between which are key value pairs (keys have slashes like "/E").

And here's a C# method that finds the relevant number using a regex:

public int GetPageOneLength(byte[] data)
{
  // According to ISO PDF spec: "The linearization parameter dictionary shall be entirely contained within the first 1024 bytes of the PDF file" (p. 679)
  string preamble = new string(ASCIIEncoding.ASCII.GetChars(data, 0, 1024));    // Note that the binary section on line 2 of the header will be entirely converted to question martks ('?')
  var match = Regex.Match(preamble, @"<<\w*/Linearized.+/E\s+(?<offset>\d+).+>>");
  if (!match.Success) throw new InvalidDataException("PDF does not have a proper linearization dictionary");
  return int.Parse(match.Groups["offset"].Value);
}

Note Bobrovsky's caveat that a file may contain the linearization dictionary, yet may not be properly linearized (perhaps because of an incremental edit?). In my case, this is not a problem, as I will linearize all the PDFs myself.

like image 393
Sten L Avatar asked Apr 10 '12 22:04

Sten L


People also ask

How can you tell if a PDF is linearized?

How to check if a PDF is linearized? In Adobe Acrobat and Adobe Reader, the best way to see if a PDF is Linearized is to look at the Document properties. If the file is a linearized PDF, the item Fast Web View will display Yes.

How do you linearize a PDF?

Adding the parameter --linearize to the [options] section of a command will allow you to convert documents into linearized PDF files. This will convert a single document named “DocName. doc” in the current working directory into linearized PDF.

What is the difference between PDF A and linearized PDF?

A linearized PDF file is a special format of a PDF file that makes viewing faster over the Internet. Linearized PDF files contains information that allow a byte-streaming server to download the PDF file one page at a time.

What is EOF in PDF?

File Trailer. The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF.


1 Answers

Linearization dictionary should help you with this.

The dictionary required to contain E parameter that is

The offset of the end of the first page (the end of part 6 in Example F.1), relative to the beginning of the file.

Please note that not every file with a linearization dictionary is actually linearized (broken generators, changes after linearization etc.) So, you might not be able to use described approach if your files are not verified to be properly linearized.

Please have a look at F.2.2 Linearization Parameter Dictionary (Part 2) in PDF Reference for more information about linearization dictionary.

like image 143
Bobrovsky Avatar answered Nov 15 '22 04:11

Bobrovsky