I need to find the difference between two PDF files. Does anybody know of any Python-related tool which has a feature that directly gives the diff of the two PDFs?

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities). If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts. This question deals with pdf to text conversion in python: Python module for converting PDF to text. The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical. This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

How to get the diff of two PDF files using Python?

1 Answers

What do you mean by "difference"? A difference in the text of the PDF or some layout change (e.g. an embedded graphic was resized). The first is easy to detect, the second is almost impossible to get (PDF is an VERY complicated file format, that offers endless file formatting capabilities).

If you want to get the text diff, just run a pdf to text utility on the two PDFs and then use Python's built-in diff library to get the difference of the converted texts.

This question deals with pdf to text conversion in python: Python module for converting PDF to text.

The reliability of this method depends on the PDF Generators you are using. If you use e.g. Adobe Acrobat and some Ghostscript-based PDF-Creator to make two PDFs from the SAME word document, you might still get a diff although the source document was identical.

This is because there are dozens of ways to encode the information of the source document to a PDF and each converter uses a different approach. Often the pdf to text converter can't figure out the correct text flow, especially with complex layouts or tables.

111

answered Oct 04 '22 20:10

fbuchinger

Related questions
                            
                                How can I provide shared state to my Flask app with multiple workers without depending on additional software?
                            
                                Serverless: python3.7 not found! Try the pythonBin option
                            
                                Python OpenCV skew correction for OCR
                            
                                Sum numbers in a list but change their sign after zero is encountered
                            
                                error importing 'BlobServiceClient' from 'azure.storage.blob'
                            
                                Move every second row to row above in pandas dataframe
                            
                                Python remove elements that are greater than a threshold from a list
                            
                                AttributeError: module 'time' has no attribute 'clock' In SQLAlchemy python 3.8.2
                            
                                How to delete all instances of a repeated number in a list? [duplicate]
                            
                                seaborn FutureWarning: Pass the following variables as keyword args: x, y
                            
                                ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. when trying to install dotenv
                            
                                Can't scrape all the company names from a webpage
                            
                                "NodeAlreadySaved " error when using djangocms publishing page changes
                            
                                How to crop white patches in image and make passport size photo using OpenCV
                            
                                Is it possible to set a timeout on a socket in Twisted?
                            
                                Effective Keyboard Input Handling
                            
                                How to localize Content of a Django application
                            
                                How to download a file over http with authorization in python 3.0, working around bugs?
                            
                                Django missing translation of some strings. Any idea why?
                            
                                Is there a way to set multiple defaults on a Python dict using another dict?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the diff of two PDF files using Python?

Tags:

python

pdf

Goutham

People also ask

1 Answers

fbuchinger

Recent Activity

Donate For Us