Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF compressing library/tool

I am working on a project to reduce the size of the PDF's, compress them. I am wondering are there any good tools/library (.NET) in market that are really good. I did try few tools like Onstream Compression, but the results were not satisfactory.

like image 916
Sabby62 Avatar asked Jan 11 '23 09:01

Sabby62


2 Answers

Some additional (mega-)bytes can easily be squeezed out of PDFs. E.g., is a well known "PDF32000_2008.pdf" optimized enough? File size is 8,995,189 bytes. It uses object and xref streams, (nearly) no images, everything is packed tight. Or is it not?

Look at a page dictionary:

Dict:9 [1 0 R]
.   /Annots Array:3
.   /Contents Stream:3 [2 0 R]
.   /CropBox Array:4
.   /MediaBox Array:4
.   /Parent Dict:4 [124248 0 R]
.   /Resources Dict:4
.   /Rotate 0 (Number)
.   /StructParents 2 (Number)
.   /Type Page (Name)

Rotate 0 is a default, why is it there? What is CropBox there for? It defaults to MediaBox, and there's no page in this document with CropBox other than MediaBox. Why is MediaBox there? It's inheritable, all pages are the same size, so move it to Pages tree root! There are 756 pages, i.e. redundant (or useless) information replicated 756 times.

Look at typical Annotation dictionary:

Dict:6 [3548 0 R]
.   /A Dict:2
.   .   /S URI (Name)
.   .   /URI http://www.iso.org/iso/iso_catalogue/... (String)
.   /Border Array:3
.   .   [0] 0 (Number)
.   .   [1] 0 (Number)
.   .   [2] 0 (Number)
.   /Rect Array:4
.   .   [0] 82.14 (Number)
.   .   [1] 576.8 (Number)
.   .   [2] 137.1 (Number)
.   .   [3] 587.18 (Number)
.   /StructParent 3 (Number)
.   /Subtype Link (Name)
.   /Type Annot (Name)

There are thousands (maybe > 10'000?) link annotations in this document. /Type key is optional, why is it there? They are invisible rectangles, do you think their placement precision other than whole number of points is relevant? Round it to integer.

Look at the fragment of typical page content stream, text showing operator:

[(w)7(ed)-6( b)21(u)1(t shal)-6(l no)-6(t b)-6(e)1( ed)-6(ite)-6(d)1( un)-6(less the typef)23(aces wh)-6(ich )]TJ

Kerning of less than some value is all but invisible. This value may be debated, it's like JPEG compression quality level - acceptable to some, others disagree. I think that very conservative estimate (i.e. retaining most quality), with effect invisible to general person, is that kerning of absolute value less than 10 may be omitted. (Care must be taken to preserve justification, of course). (And I don't even mention that there are files out there with fractional kerning with precision of 3-6 decimal places! But not in this file)

And, with optimizations mentioned above, file size became 7,982,478 bytes. One megabyte shaved off. And it's certainly not the limit, there maybe others, that are hidden better, sources of optimization.

like image 192
user2846289 Avatar answered Jan 18 '23 03:01

user2846289


To add a few more notes to already good answers, there are a whole range of applications / libraries that will reduce the file size of PDF files. The first question, going along with @Jongware's answer, is whether anything can be done to begin with.

If your PDF files are coming from everywhere (you have no control over the source), gather a sample of files and determine what your requirements for the resulting PDFs are. If you only want to show them on screen for example, you have the option to resample images to a much lower resolution (be careful, that isn't the case any more for mobile use necessarily). If the PDFs are all internal you have it easier, because you can inspect them and see where you could save.

Use Adobe Acrobat's "Space Audit" feature. Adobe seems to find satisfaction in hiding this nice tool and moving it around between versions of Acrobat, but in Acrobat Pro XI it can be found by opening a PDF file and then selecting "File > Save as other > Optimized PDF..." (not "Reduced size PDF" as you would think). In the dialog window that shows up there's an "Audit space usage" button that will bring up an information window showing how much space elements in the PDF are using.

Depending on what you find there, there are multiple things you can do, most are already mentioned but here's an incomplete list:

  • Downsample images.
  • Change color spaces of images from CMYK to RGB. Be cautious about this as it will a) not provide the space savings you might think (because of compression) and b) might actually be counter-productive if you're unlucky (because of indexing and other neat image tricks).
  • Remove document and object level metadata (some sample sets of magazine page files I have contain more metadata than actual content).
  • Remove proprietary application data (Illustrator has a nasty habit of embedding the complete Illustrator document into a PDF file if you're not careful).
  • Compress object streams and XRef tables if you're sure all readers you're using will be able to handle that.
  • Use optimal compression IF your target readers will handle that (JBIG2, JPEG2000...)
  • Optimize the file structure (some bad PDF files don't optimise fonts and other objects and will have multiple copies scattered throughout the file).
  • Subset all fonts in the document.
  • Remove ICC profiles if they're not needed.

If you want to perform these tasks, there are many tools that can help. Either libraries to let you implement this yourself or commercial (and probably other) tools that will work though command-line with predefined actions. callas pdfToolbox is one of these tools (I'm connected to this company!), Enfocus PitStop has functionality in this area, Apago also has functionality here (though I'm not sure they have a command-line version of the top of my head).

like image 21
David van Driessche Avatar answered Jan 18 '23 02:01

David van Driessche