Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare two PDFs based on visual differences programmatically? [closed]

I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.

I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.

By this approach I'm able to know that particular page is differing.

But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.

Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.

Please suggest me someway to achieve this.

PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.

like image 326
unknown_boundaries Avatar asked Mar 29 '26 22:03

unknown_boundaries


1 Answers

PDF to image using Java

Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)

https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw

A good library for converting PDF to TIFF?

Convert jpeg/png to an array of pixels in java

int pixels array to bmp in java

Finding pixel position

Get Pixel Color around an image

For extraction of text using PDFBox: Extracting text from PDF file using pdfbox

There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.

http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html

http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html

like image 117
Rasshu Avatar answered Apr 01 '26 10:04

Rasshu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!