How to embed external OCR into existing PDF?

Question

I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?

Some details:

I don't want to use Acrobat's OCR functionality;
The OCR process results in a XML file which contains elements like:

<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.

[Should I put this update into a separate question?]

Jukka Matilainen · Accepted Answer

For your follow-up question about processing PDF files without losing the the hidden layers: I believe Ghostscript is able to do this. For example, the following command should convert a PDF to grayscale:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf

How to embed external OCR into existing PDF?

Tags:

xml

pdf

ocr

kepler

1 Answers

Jukka Matilainen

Recent Activity

Donate For Us

How to embed external OCR into existing PDF?

Tags:

xml

pdf

ocr

kepler

1 Answers

Jukka Matilainen

Related questions

Recent Activity

Donate For Us