Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDF compare on linux command line

I'm looking for a Linux command line tool to compare two PDF files and save the diffs to a PDF outfile. The tool should create diff-pdf's in a batch-process. The PDF files are construction plans, so pure text-compare doesn't work.

Something like:

<tool> file1.pdf file2.pdf -o diff-out.pdf 

Most of the tools I found convert the PDFs to images and compare them, but only with a GUI.

Any other solution is also welcome.

like image 381
Christof Aenderl Avatar asked Jun 24 '11 14:06

Christof Aenderl


People also ask

How do I compare two PDF files in Linux?

For comparing the text of two PDF files, first, insert file File#1 and File#2 and from right sidebar select 'Words' from Compare drop-down list. Select view of the files and use preview and next button to see the next changes in the PDFs.

How do I compare two PDF files in Ubuntu?

To compare the text of two PDF files, first we have to insert File # 1 and File # 2. From the right sidebar we will select 'Words' in the drop-down list “Compare”. Use the preview and next button to see the differences between PDFs.

Does Diff work on PDF?

6 Answers. You can use DiffPDF for this. From the description: DiffPDF is used to compare two PDF files.


1 Answers

I've written my own script that does something similar to what you're asking for. The script uses 4 tools to achieve its goal:

  1. ImageMagick's compare command
  2. the pdftk utility (if you have multipage PDFs)
  3. Ghostscript (optional)
  4. md5sum (optional)

It should be quite easy to port this to a .bat batch file for DOS/Windows.

But first, please note: this only works well for PDFs which have the same page/media size. The comparison is done pixel by pixel between the two input PDFs. The resulting file is an image showing the "diff" like this:

  • Each pixel that remains unchanged becomes white.
  • Each pixel that got changed is painted in red.

That diff image is saved as a new PDF to make it better accessible on different OS platforms.

I'm using this for example to discover minimal page display differences when font substitution in PDF processing comes into play.

It could happen, that there is no visible difference between your PDFs, though they are different in MD5 hashes and/or file size. In this case the "diff" output PDF page would become all-white. You could automatically discover this condition, so you only have to visually investigate the non-white PDFs by deleting the all-white ones automatically.

Here are the building blocks:

pdftk

Use this command line utility to split multipage PDF files into multiple singlepage PDFs:

pdftk  file_1.pdf  burst  output  somewhere/file_1---page_%03d.pdf pdftk  file_2.pdf  burst  output  somewhere/file_2---page_%03d.pdf 

If you are comparing 1-page PDFs only, this building block is optional. Since you talk about "construction plans", this is likely the case.

compare

Use this command line utility from ImageMagick to create a "diff" PDF page for each of the pages:

compare \        -verbose \        -debug coder \        -log "%u %m:%l %e" \         somewhere/file_1---page_001.pdf \         somewhere/file_2---page_001.pdf \        -compose src \         somewhereelse/file_1--file_2---diff_page_001.pdf 

Ghostscript

Because of automatically inserted meta data (such as the current date+time), PDF output is not working well for MD5hash-based file comparisons.

If you want to automatically discover all cases where the diff PDF consist of a purely white page, you should convert the PDF page to a meta-data free bitmap format using the bmp256 output device. You can do that like this:

First, find out what the page size format of your PDF is. Again, this little utility identify comes as part of any ImageMagick installation:

 identify \    -format "%[fx:(w)]x%[fx:(h)]" \     somewhereelse/file_1--file_2---diff_page_001.pdf 

You can store this value in an environment variable like this:

 export my_size=$(identify \    -format "%[fx:(w)]x%[fx:(h)]" \     somewhereelse/file_1--file_2---diff_page_001.pdf) 

Now Ghostscript comes into play, using a commandline which includes the above discovered page size as it is stored in the variable:

 gs \    -o somewhereelse/file_1--file_2---diff_page_001.ppm \    -sDEVICE=ppmraw \    -r72 \    -g${my_size} \     somewhereelse/file_1--file_2---diff_page_001.pdf 

This gives you a PPM (Portable PixMap) with a resolution of 72 dpi from the original PDF page. 72 dpi usually is good enough for what we want... Next, create a purely white PPM page with the same page size:

 gs \    -o somewhereelse/file_1--file_2---whitepage_001.ppm \    -sDEVICE=ppmraw \    -r72 \    -g${my_size} \    -c "showpage" 

The -c "showpage" part is a PostScript command that tells Ghostscript to emit an empty page only.

MD5 sum

Use the MD5 hash to automatically compare the original PPM with the whitepage PPM. In case they are the same, you can savely assume that there are no differences between the PDFs and therefore rename or delete the diff-PDF:

 MD5_1=$(md5sum somewhereelse/file_1--file_2---diff_page_001.ppm | awk '{print $1}')  MD5_2=$(md5sum somewhereelse/file_1--file_2---whitepage_001.ppm | awk '{print $1}')   if [ "x${MD5_1}" == "x${MD5_2}" ]; then       mv  \        somewhereelse/file_1--file_2---diff_page_001.pdf \        somewhereelse/file_1--file_2---NODIFFERENCE_page_001.pdf # rename all-white PDF      rm  \        somewhereelse/file_1--file_2---*_page_001.ppm            # delete both PPMs  fi 

This spares you from having to visually inspect "diff PDFs" that do not have any differences.

like image 177
Kurt Pfeifle Avatar answered Oct 01 '22 17:10

Kurt Pfeifle