Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert image to searchable pdf [closed]

Tags:

java

pdf

ocr

tiff

Hi I am looking for a open-source java API that can convert tiff image to searchable pdf (OCR). I have research around but found nothing so far.

NOTE I have looked at this post but this API does not convert the image to pdf Java OCR implementation. However, I am still playing with the code a bit.

like image 558
Thang Pham Avatar asked Feb 01 '12 20:02

Thang Pham


2 Answers

You can convert images to PDF using iText. The hard thing here is doing the OCR, not creating the PDF.

I will warn you: any OCR engine that is worth using is going to cost you a significant amount of money. Free and/or open source ones are generally pet projects, proof of concept for some algorithm or another. Not suitable for real world OCR applications. Tesseract is probably the best of the bunch, but even that has accuracies that are far, far worse than commercial engines.

We have a commercial OCR application, and I've been down this path while evaluating engines - I'd suggest that you bite the bullet and reach out to the engine providers and get quotes: Abbyy (best accuracy, most expensive, slower), Expervision (fast, not as accurate, middle of the road price), Nuance (middle of the road speed, accuracy and price). None of these will be written in Java, so you should plan some time to develop JNI code around their APIs.

Good luck - it's a big project!

like image 75
Kevin Day Avatar answered Sep 30 '22 09:09

Kevin Day


Cuneiform is free and easy to use, it will output in hocr format, which can then be used to generate an invisible text layer on a PDF using hocr2pdf tool, which is part of ExactImage.

like image 42
Alasdair Avatar answered Sep 30 '22 11:09

Alasdair