Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How are PDF files able to be partially displayed while downloading?

According to the PDF 1.7 specification, page 90, Sec 3.4:

The preceding sections describe the syntax of individual objects. This section describes how objects are organized in a PDF file for efficient random access and incremental update. A canonical PDF file initially consists of four elements (see Figure 3.2):

  • A one-line header identifying the version of the PDF specification to which the file conforms

  • A body containing the objects that make up the document contained in the file

  • A cross-reference table containing information about the indirect objects in the file

  • A trailer giving the location of the cross-reference table and of certain special objects within the body of the file

Basically, the structure has the header, followed by the body content, then the cross reference table, and finally the trailer which gives the location of the xref table. The key part here is that the trailer and xref tables are at the end of the file, and the xref table contains the pertinent metadata of the body content (mainly the 10-digit byte offset).

Given that the xref table itself is located at the very end of a PDF file:

  • How is it that my browser (Google Chrome) was able to partially display the PDF file (the first hundred pages or so) before the entire file was finished downloading?

See screenshot of my partially downloaded PDF file:

Partially downloaded PDF file

like image 553
myermian Avatar asked Apr 02 '15 15:04

myermian


People also ask

Why does my PDF not display correctly?

Try resetting the display preference in your browser to clear up the viewing issue. In Reader or Acrobat, right-click the document window, and choose Page Display Preferences. From the list at left, select Internet. Deselect Display PDF in browser, and then click OK.


1 Answers

The type of PDF files the OP describes is also known as "web optimized" (marketing term) or "linearized" (technical term in PDF parlance).

It has to be noted that it only works if two extra conditions (on top of the linearization feature of the files) are met:

  1. The PDF viewer needs to be able to handle these types of PDF and take advantage of the linearization feature.
  2. The (remote) host serving the linearized PDFs needs to support "byte streaming".

If byte-streaming is not supported by the server or if the PDF file is not linearized, the entire file still needs to be downloaded completely before it the viewer can display any page.

The description about the PDF file structure quoted by the OP does not apply to linearized PDF files. These are organized in a slightly different way:

  1. There apply special rules for ordering of PDF objects ("standard" PDFs can have objects in any arbitrary order).
  2. The PDF document needs to contain some additional structures called "hint tables" which guarantee efficient navigation within it (even if it is not yet completely downloaded).

Regarding the additional structures, a linearized PDF contains its objects in two groups:

  1. In the first group is the document catalogue, all document-level objects, and all objects belonging to the first-to-be-displayed page (not necessarily "page 0"!). The objects shall be numbered sequentially.

  2. The second group holds all the other objects.

These groups shall be indexed by two xref table sections.

  1. The first group's xref section appears immediately after the first indirect object, very close to the beginning of the file.
  2. The second group's xref section is positioned at the end of the file (just as in standard, non-linearized PDFs).

The first object immediately after the %PDF-1.x header line shall contain a dictionary key indicating the /Linearized property of the file.

This overall structure allows a conforming reader to learn the complete list of object addresses very quickly, without needing to download the complete file from beginning to end:

  • The viewer can display the first page(s) very fast, before the complete file is downloaded.

  • The user can click on a thumbnail page preview (or a link in the ToC of the file) in order to jump to, say, page 445, immediately after the first page(s) have been displayed, and the viewer can then request all the objects required for page 445 by asking the remote server via byte range requests to deliver these "out of order" so the viewer can display this page faster. (While the user reads pages out of order, the downloading of the complete document will still go on in the background...)

The technical details of PDF "linearization" can be found in the 'normative' Appendix F of Adobe's original PDF 1.7 Specification (ca. 11 MByte -- which in itself is an example of such a linearized PDF file!)

like image 161
Kurt Pfeifle Avatar answered Sep 20 '22 14:09

Kurt Pfeifle