PDF compare on linux command line

Tags:

I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.

Something like:

<tool> file1.pdf file2.pdf -o diff-out.pdf

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

Any other solution is also welcome.

381

asked Jun 24 '11 14:06

Christof Aenderl

1 Answers

I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

ImageMagick's compare command
the pdftk utility (if you have multipage PDFs)
Ghostscript (optional)
md5sum (optional)

It should be quite easy to port this to a .bat batch file for DOS/Windows.

But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

Each pixel that remains unchanged becomes white.
Each pixel that got changed is painted in red.

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

Here are the building blocks:

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdf pdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

compare

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

compare \        -verbose \        -debug coder \        -log "%u %m:%l %e" \         somewhere/file_1---page_001.pdf \         somewhere/file_2---page_001.pdf \        -compose src \         somewhereelse/file_1--file_2---diff_page_001.pdf

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256 output device. You can do that like this:

First, find out what the page size format of your PDF is. Again, this little utility identify comes as part of any ImageMagick installation:

 identify \    -format "%[fx:(w)]x%[fx:(h)]" \     somewhereelse/file_1--file_2---diff_page_001.pdf

You can store this value in an environment variable like this:

 export my_size=$(identify \    -format "%[fx:(w)]x%[fx:(h)]" \     somewhereelse/file_1--file_2---diff_page_001.pdf)

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

 gs \    -o somewhereelse/file_1--file_2---diff_page_001.ppm \    -sDEVICE=ppmraw \    -r72 \    -g${my_size} \     somewhereelse/file_1--file_2---diff_page_001.pdf

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

 gs \    -o somewhereelse/file_1--file_2---whitepage_001.ppm \    -sDEVICE=ppmraw \    -r72 \    -g${my_size} \    -c "showpage"

The -c "showpage" part is a PostScript command that tells Ghostscript to emit an empty page only.

MD5 sum

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')  MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')   if [ "x${MD5_1}" == "x${MD5_2}" ]; then       mv  \        somewhereelse/file_1--file_2---diff_page_001.pdf \        somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF      rm  \        somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs  fi

This spares you from having to visually inspect "diff PDFs" that do not have any differences.

177

answered Oct 01 '22 17:10

Kurt Pfeifle

Related questions
                            
                                How to sort the output of "grep -l" chronologically by newest modification date last?
                            
                                How to UDP Broadcast with C in Linux?
                            
                                How can I untar a tar.bz file in unix?
                            
                                what is the difference between uint16_t and unsigned short int incase of 64 bit processor?
                            
                                How to delete the first column ( which is in fact row names) from a data file in linux?
                            
                                Search and replace with sed when dots and underscores are present
                            
                                screen Cannot open your terminal '/dev/pts/0' - please check
                            
                                How to listen for multiple tcp connection using nc
                            
                                Reverse sort order of a multicolumn file in BASH
                            
                                grep a large list against a large file
                            
                                Copy file permissions, but not files [closed]
                            
                                "cannot write to log file pg_upgrade_internal.log" when upgrading from Postgresql 9.1 to 9.3
                            
                                cat file with no line wrap
                            
                                getting HTML source or rich text from the X clipboard
                            
                                Running PHP script from command line as background process
                            
                                How to Check if the function exists in C/C++
                            
                                In Linux, how do I get man pages for C functions rather than for bash commands?
                            
                                X11 forwarding request failed on channel 0
                            
                                How to sort strings that contain a common prefix and suffix numerically from Bash?
                            
                                What's the difference between vim regex and normal regex?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PDF compare on linux command line

Tags:

linux

comparison

pdf

ghostscript