I have a pdf with watermark at the background of it. When start scanning for highlighting any word with watermark or annotation at background, that gets selected as it is found first in touch area.
I am using CGPDFScanner to scan the text.
My question is how detect if scanned text is text at background or real text in PDF? How do I differentiate between standard text and annotation text?
Thanks.
In general you have no chance to reliably differentiate between "background" and "real" text. Text is drawn somewhere on the page in some order, and what is foreground, background, normal text, ..., is a matter of human perception and may not at all be reflected in the structure of the PDF content stream.
You can try some educated guesswork, e.g. assuming that "real" text is in strong colors while background text is in lighter colors, or "real" text is arranged in horizontal lines while background text is often more diagonal, etc. But this is guesswork after all, nothing to rely on for sure.
On the other hand, in case of tagged PDFs you might have a chance, the watermark may be tagged as artifact data.
PS I just saw you shared your file again. In case of your document the heuristics I mentioned would work, the background text is greyish and printed diagonally.
Thus, while scanning you have to keep track of the fill color and/or the transformation matrices. As soon as the scanner finds text, you know whether it is background or foreground based on the current color and/or matrix value.
Be aware, though, it is not that easy with all documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With