Change text in PDF using Apple's PDFKit framework

Question

I know in Apple's PDFKit I can get 'string' which returns an NSString object representing the text on the page. https://developer.apple.com/documentation/pdfkit/pdfpage?language=objc

Is there a way to change text that's in the PDF? If not, how do you recommend I go about figuring out how to edit text in a PDF? Thank you!

Joris Schellekens · Accepted Answer

To understand your real problem, you need to know more about how a PDF works. First, a PDF is more like a container of (drawing, rendering) instructions than a container of content.

There are two flavors of PDF. Tagged and untagged. Tagged PDF is essentially a normal PDF document + a tree-like datastructure that tells you which parts of the document make up which logical elements.

Comparable to HTML, which contains a logical structure, the tags mark paragraphs, bullet points in lists, rows in tables, etc.

If you have an untagged document, you are essentially left with nothing but the bare rendering instructions

go to position 50, 50
set font to Arial
set font color to 0, color-space to grayshades
draw the glyph for 'H'
go to position 60, 50
draw the glyph for 'e'

Instructions like this are gathered into objects. Objects can be gathered into streams. Streams can be compressed. Instructions and objects do not need to appear in any logical order.

Having objects means that you can re-use certain things. Like drawing an image on every page of a company letterhead. Or instructions like 'use the font in object 456'.

In order to be able to work with these objects, every object is given a number. And a mapping of objects, their number, and their byte-offset in the file is stored at the back of the document. This is known as the XREF table.

xref
152 42
0000000016 00000 n
0000001240 00000 n
0000002133 00000 n
0000002296 00000 n
0000002344 00000 n
0000002380 00000 n
0000002551 00000 n

Now, back to your problem. Suppose that you change a word 'dog' by a word 'cats'.

You'd run into several problems:

every byte offset in the document is suddenly wrong, since 'cats' contains 4 bytes, and 'dog' contains 3 bytes.
no object can be found, all instructions go wrong
if at any point your substitution causes the text to go too far out of alignment, you would need to perform layout again.

Why is layout such a problem?

Remember what I said earlier about the PDF containing only the rendering instructions. It's insanely hard to reconstruct things like paragraph-boundaries, or tables, lists, etc from the raw instructions.

Especially so if you want to do this for other scripts than just Latin script (imagine Hebrew, or Arabic). Or if your page layout is non-standard (like a scientific article, which appears in columns rather than lines that take up an entire page.)

Structure recognition is in fact the topic of ongoing research.

Change text in PDF using Apple's PDFKit framework

Tags:

xcode

macos

ios

pdf

Jim Bak

1 Answers

Joris Schellekens

Recent Activity

Donate For Us

Change text in PDF using Apple's PDFKit framework

Tags:

xcode

macos

ios

pdf

Jim Bak

1 Answers

Joris Schellekens

Related questions

Recent Activity

Donate For Us