Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change text in PDF using Apple's PDFKit framework

Tags:

xcode

macos

ios

pdf

I know in Apple's PDFKit I can get 'string' which returns an NSString object representing the text on the page. https://developer.apple.com/documentation/pdfkit/pdfpage?language=objc

Is there a way to change text that's in the PDF? If not, how do you recommend I go about figuring out how to edit text in a PDF? Thank you!

like image 769
Jim Bak Avatar asked Jan 03 '23 22:01

Jim Bak


1 Answers

To understand your real problem, you need to know more about how a PDF works. First, a PDF is more like a container of (drawing, rendering) instructions than a container of content.

There are two flavors of PDF. Tagged and untagged. Tagged PDF is essentially a normal PDF document + a tree-like datastructure that tells you which parts of the document make up which logical elements.

Comparable to HTML, which contains a logical structure, the tags mark paragraphs, bullet points in lists, rows in tables, etc.

If you have an untagged document, you are essentially left with nothing but the bare rendering instructions

go to position 50, 50
set font to Arial
set font color to 0, color-space to grayshades
draw the glyph for 'H'
go to position 60, 50
draw the glyph for 'e'

Instructions like this are gathered into objects. Objects can be gathered into streams. Streams can be compressed. Instructions and objects do not need to appear in any logical order.

Having objects means that you can re-use certain things. Like drawing an image on every page of a company letterhead. Or instructions like 'use the font in object 456'.

In order to be able to work with these objects, every object is given a number. And a mapping of objects, their number, and their byte-offset in the file is stored at the back of the document. This is known as the XREF table.

xref
152 42
0000000016 00000 n
0000001240 00000 n
0000002133 00000 n
0000002296 00000 n
0000002344 00000 n
0000002380 00000 n
0000002551 00000 n

Now, back to your problem. Suppose that you change a word 'dog' by a word 'cats'.

You'd run into several problems:

  • every byte offset in the document is suddenly wrong, since 'cats' contains 4 bytes, and 'dog' contains 3 bytes.
  • no object can be found, all instructions go wrong
  • if at any point your substitution causes the text to go too far out of alignment, you would need to perform layout again.

Why is layout such a problem?

Remember what I said earlier about the PDF containing only the rendering instructions. It's insanely hard to reconstruct things like paragraph-boundaries, or tables, lists, etc from the raw instructions.

Especially so if you want to do this for other scripts than just Latin script (imagine Hebrew, or Arabic). Or if your page layout is non-standard (like a scientific article, which appears in columns rather than lines that take up an entire page.)

Structure recognition is in fact the topic of ongoing research.

like image 163
Joris Schellekens Avatar answered Jan 05 '23 11:01

Joris Schellekens