Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

add a duplicate (hidden) text layer to a pdf for extra searching

Tags:

search

pdf

My problem:

I have a pdf with lots of roman characters with complex diacritical marks (e.g., ṣ, ś, ṝ, ǎ, etc.). To make it easier to search within the pdf, I would like to add an additional layer, much as one does with hocr, where the same text is present without the diacritics.

When using full-text search engines I can index multiple terms at the same position (vector) - I would like to achieve the same effect here.

I have read lots about adding a hocr layer to scanned images, but I really just want to duplicate the text layer, pass it through a script that strips the diacritics (straightforward enough) and then adds it back in as a hidden but searchable layer.

Anyone have any suggestions? (Solutions involving any platform, language, library or toolchain will be useful!)

Thanks :)

Edit: please let me know if the question is unclear.

like image 828
simon Avatar asked Feb 26 '23 23:02

simon


2 Answers

Well I have a (slightly ugly and hackish) solution, so I thought I'd share it.

I'm using PDFMiner to extract the text, along with the co-ordinates. Then I'm using ReportLab to write the normalized versions of the text to a new pdf, in exactly the same position, as hidden text. To make the positions line up properly, I found I had to use exactly the same font, so I've used a combination of FontForge and MuPDF to extract the required font(s) from the original pdf.

Finally, having created the new pdf, I'm using pdftk to merge it with the original.

It works pretty well, but has the downside that copying text out of the pdf results in the normalized text being copied too. But this is acceptable for my present purposes, and I can't see any way around it. The pdf spec. doesn't really support my objective, and so I don't imagine I can do better than this hackish solution.

like image 182
simon Avatar answered Mar 05 '23 16:03

simon


I have written something similar to add searchable text by OCR'ing images and converting it to PDF in C#. I used QuickPDF from www.quickpdf.com to create hidden white text objects on top of the image and this worked reasonably well.

In your case QuickPDF would allow you to extract the text strings along with bounding boxes and font details. You could then normalize your text and create the invisible text objects using the existing font and position information and then save it out to a new file.

This would basically give you the same PDF as you have now and also give you both the original and normalised text as you are getting now.

QuickPDF is a commercial library. If your solution works well for you then there is no used buying a commercial engine though. The nice thing though is that it only requires 1 SDK and you would look at it if you had a more than a few PDF's to convert.

like image 25
Andrew Cash Avatar answered Mar 05 '23 16:03

Andrew Cash