Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the diff of two PDF files using Python?

Tags:

python

pdf

I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

like image 273
Goutham Avatar asked Aug 21 '09 09:08

Goutham


People also ask

Does Diff work on PDF?

6 Answers. You can use DiffPDF for this. From the description: DiffPDF is used to compare two PDF files.

How extract specific data from PDF in Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.

Can you scrape data from a PDF Python?

As of today, companies still manually process PDF data. With the help of python libraries, we can save time and money by automating this process of scraping data from PDF files and converting unstructured data into panel data.


1 Answers

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

like image 111
fbuchinger Avatar answered Oct 04 '22 20:10

fbuchinger