Background
The idea is this:
- Person provides contact information for online book purchase
- Book, as a PDF, is marked with a unique hash
- Person downloads book
- PDF passwords are easy to circumvent, or share
The ideal process would be something like:
- Generate hash based on contact information
- Store contact information and hash in database
- Acquire book lock
- Update an "include" file with hash text
- Generate book as PDF (using
pdflatex
)
- Apply hash to book
- Release book lock
- Send email with book download link
Technologies
The following technologies can be used (other programming languages are possible, but libraries will likely be limited to those supplied by the host):
- C, Java, PHP
- LaTeX files
- PDF files
- Linux
Question
What programming techniques (or open source software) should I investigate to:
- Embed a unique hash (or other mark) to a PDF
- Create a collusion-attack resistant mark
- Develop a non-fragile (e.g.,
PDF -> EPS -> PDF
still contains the mark) solution
Research
I have looked at the following possibilities:
- Steganography
- Natural Language Processing (NLP)
- Convert blank pages in PDF to images; mark those images; reassemble PDF
- LaTeX watermark package
- ImageMagick
Issues
The possible solutions I have researched have the following issues:
-
Steganography. (a) Requires a master copy of the images, which are converted to EPS, which is CPU-intensive and time-consuming; (b) would the watermark survive
PDF -> EPS -> PDF
, or other types of conversion; (c) most images are drawings or screen captures, not photographs in PNG format.
-
LaTeX. Creates an image cache; any steganographic solution would have to intercept that process somehow.
-
NLP. Introduces grammatical errors; could change meaning of technical words.
-
Blank Pages. Immediately suspect; it is easy to replace suspicious blank pages.
-
Watermark Package. Draws visible marks.
-
ImageMagick. Draws visible marks.
What other solutions are possible?
Related Links
- http://www.tcpdf.org/
- invisible watermarks in images
Thank you!
I've done this for another project with PDFlib. We needed traceability for the generated PDFs in case the file was leaked. Basically:
- Created a source template PDF with the content in place, set the document master password with the required options (no edit, no print, no screen-reader, etc...) set
- At runtime, we applied a few watermarks (imposed page footer saying "This document checked out to user #12345", set a few of the metadata fields with user ID, download IP, download date/time, added a "this document copyright by..." cover page, etc...)
- Optionally attach a user password to force a PW prompt when document is opened.
Since the latest PDF versions use AES-128 for their encryption, we just set a suitable randomly generated 128char high-entropy password - no one would ever be typing it in by hand so hard-to-typedness was irrelevant to us and actually preferable. The master password prevented end-users from making any changes to the document. The various noprint/no screen read options are actually enforced by the PDF reader and therefore bypassable, but can't hurt to set them anyways.
The downside to this is that PDFlib's licensing is fairly steep. I don't know if any of the free php PDF libraries support the latest PDF encryption schemes, especially the master password stuff, but if you budget can support it, PDFlib's the way to go for secure document production.