I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?
Some details:
I don't want to use Acrobat's OCR functionality;
The OCR process results in a XML file which contains elements like:
<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>
Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.
[Should I put this update into a separate question?]
For your follow-up question about processing PDF files without losing the the hidden layers: I believe Ghostscript is able to do this. For example, the following command should convert a PDF to grayscale:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With