Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make git ignore the date in PDF files

Tags:

git

First: I am aware of the general comment: Do not track generated files.

Say, I want to track generated PDFs and have git ignore the date written into the PDF. That means, I want git to treat two PDFs as the same if the only difference is the the Date information.

What I tried is a filter that -- in its clean part -- sets the date to some arbitrary value.

(--- comment ----
basically, the filter does sth along:

## dump the pdf metadata to a file and replace the dates
pdftk "$FILENAME" dump_data | sed -e '{N;s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/}' > "$TMPFILE"

## update the pdf metadata
pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2"

) --- end comment ----

The filter works (the committed pdf has the date set to my arbitrary value) but I ran into files re-checked out from git repository with 'clean' filter end up with modified status

So, my filter is apparently not what I want to do here.

My question is:
1) Can I use a clever filter approach to get git ignore the date values in the PDF completely? And how?
or
2) What would be the correct approach if not filters?

like image 761
Andreas Avatar asked Apr 17 '13 10:04

Andreas


1 Answers

Finally solved this with the help from the git mailing list. Wasn't a git issue after all, but more a problem of my filters expectations of pdftk. (Maybe an encoding thing? Did not dig deeper.)

The helpful message on the git mailing list is here: http://permalink.gmane.org/gmane.comp.version-control.git/224797

Basically, the filter script I wrote was not idem-potent, meaning that applying the clean filter a second time to a cleaned file would change the file.

Background: When pdftk is used to update the metadata of a pdf with the metadate it extracted from that exact pdf before, to my surprise it changes the pdf file.

So, I included a safety check into my filter and the issue has gone away.

For reference, here is the full filter:

 #!/bin/bash

 ## use GNU coreutils on OS X explicitely
 ## (install via homebrew, for instance:
 ##  > brew install coreutils
 ##  > brew install gnu-sed
 ## )
 if [ ${OSTYPE:0:6} == "darwin" ]; then
     MKTMP=gmktemp
     SED=gsed
 else
     MKTMP=mktemp
     SED=sed
 fi


 FILEASARG=true
 if [ "$#" == 0 ]; then
     FILEASARG=false
 fi

 if $FILEASARG ; then
     FILENAME="$1"
 else
     FILENAME=`$MKTMP`
     cat /dev/stdin > "${FILENAME}"
 fi

 TMPFILE=`$MKTMP`
 TMPFILE2=`$MKTMP`
 TMPFILE3=`$MKTMP`

 ## dump the pdf metadata to a file and replace the dates
 pdftk "$FILENAME" dump_data > "$TMPFILE3"
 $SED -e '/Date/{ N; s/Date\nInfoValue: D:.*/Date\nInfoValue: D:19790101072619/ }' < "$TMPFILE3" > "$TMPFILE"

 ## if the metadata did not change, do nothing
 if diff "$TMPFILE3" "$TMPFILE"; then
     rm "$TMPFILE3"
     rm "$TMPFILE"
     if [ -n $FILEASARG ] ; then
    cat "$FILENAME"
     fi
     exit 0
 fi

 ## update the pdf metadata
 pdftk "$FILENAME" update_info "$TMPFILE" output "$TMPFILE2"

 ## overwrite the original pdf
 mv -f "$TMPFILE2" "$FILENAME"

 ## clean up
 rm -f "$TMPFILE"
 rm -f "$TMPFILE2"
 if [ -n $FILEASARG ] ; then
     cat "$FILENAME"
 fi
like image 158
Andreas Avatar answered Sep 22 '22 00:09

Andreas