I have to automate a preflight check on PDF documents. The preflight consists of:
Just wondering if this can be done using PoDoFo or any other open source projects out there. Or if I really need to go order some propriety software between $2K to $6K. My hosting environment is on Linux and supports PHP, Perl, Python, Ruby, Java.
Any ideas?
Open the PDF and choose Tools > Print Production > Preflight in the right pane.
PreFlight is an advanced No-Code/Low-Code testing tool that is highly preferred by Agile teams for preventing flaky tests and eventually speeding up development velocity. Try for Free Watch Video. 5,427. Users.
I'm not aware of any ready-made Open Source software which meets your requirements.
Only a part of it could be solved by writing your own shell script (or other program).
Detect resolution of images.
Run pdfimages -list some.pdf
to output a list of images contained in the PDF as well as their dimensions... seemingly. But what is not obvious about it: these dimensions are the ones of the raw image (as embedded in the PDF). This could be 720x720 pixels. However, if rendered onto a 10x10 inch square of the page this image will be 72 DPI on the page. If rendered on a 1x1 inch square, it will be 720 DPI. Both types of 'rendering' inside a PDF can be made from the same embedded raw image, and it is the context of the current 'graphic state' which determines which is applied. So to determine the actual DPI of an image as it appears on the page requires some additional PDF parsing...
In any case, you can tell Ghostscript to re-sample images to 300 dpi, and to use a 'threshold' for this. (Ghostscript will never "upsample" an image, only downsample these which do overshoot the threshold. Upsampling almost never makes sense -- it only blows up the file size with no return in terms of higher quality.)
Convert colors to colorspace CMYK using ICC profiles.
The most recent versions of Ghostscript can do that. See also the most recent Ghostscript documentation describing its support for ICC.
Embed un-embedded fonts.
Running (and evaluating the results of) pdffonts some.pdf
will show you which fonts are not embedded.
Ghostscript can embed un-embedded fonts.
So one Ghostscript command that would cover most of your requirements is this:
gs \
-o cmyk.pdf \
-sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sProcessColorModel=DeviceCMYK \
-sOutputICCProfile=/path/to/your.icc \
-sColorImageDownsampleThreshold=2 \
-sColorImageDownsampleType=Bicubic \
-sColorImageResolution=300 \
-sGrayImageDownsampleThreshold=2 \
-sGrayImageDownsampleType=Bicubic \
-sGrayImageResolution=300 \
-sMonoImageDownsampleThreshold=2 \
-sMonoImageDownsampleType=Bicubic \
-sMonoImageResolution=1200 \
-dSubsetFonts=true \
-dEmbedAllFonts=true \
-sCannotEmbedFontPolicy=Error \
-c ".setpdfwrite<</NeverEmbed[ ]>> setdistillerparams" \
-f some.pdf
This command would downsample all images with a resolution that's higher than the double wanted resolution (*ImageDownSampleThreshold=2
). Also it would apply all these settings to any input file (unless some special PDF preflighting software which would apply selective 'fixups' based on the results of 'checks' for special properties).
Lastly, I cannot see what made think you'd have to spend $2k to $6k in case you'd have to resort to closed-source, commercial preflighting software. (My favorite in this field is the very powerful callas pdfToolbox6 (which even has a version that runs as CLI on Linux) -- its basic version costs 500 €.)
My background is in printing, so please keep this in mind when reading my answer. The items you propose to do seem somewhat straight forward, but when you get into the nitty gritty of it, there's a lot of print-industry knowledge that goes into these operations.
Here's some quick feedback to your bullet points:
You won't want to upsample an low res image to 300 dpi as it will decrease image quality (via re-interpolation) and increase files size.
You need to be careful with color conversions. There may be certain builds of RGB which you'd want to convert to black only. Or what happens if someone supplies a file which is already cmyk and tagged with the incorrect profile.
Font detection - very complicated to substitute fonts. If you don't have the exact same font as the originator, you could end up with text reflow problems. To own that font, you'll have to paid for a license. You also can't convert fonts to outlines without them being embedded.
My recommendation is to look at a commercial package for preflighting. These developers have invested years into developing their programs and are experts within the field of printing. The challenging part will be finding ones that are unix based in your price range. Most are designed for Windows or Mac. Callas has a linux cl version but not at the price listed. You'd need the server version.
What type of volume are you planning to run through it?
Did you try Enfocus PitStop Pro? Contact their support department with your specific request. They have tons of PDF preflight examples and will be happy to help you out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With