Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to embed external OCR into existing PDF?

Tags:

xml

pdf

ocr

I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?

Some details:

  • I don't want to use Acrobat's OCR functionality;

  • The OCR process results in a XML file which contains elements like:

    <line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.

[Should I put this update into a separate question?]

like image 641
kepler Avatar asked Sep 28 '09 21:09

kepler


1 Answers

For your follow-up question about processing PDF files without losing the the hidden layers: I believe Ghostscript is able to do this. For example, the following command should convert a PDF to grayscale:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf
like image 146
Jukka Matilainen Avatar answered Oct 21 '22 08:10

Jukka Matilainen