Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I delete streams / objects from a PDF file?

Tags:

pdf

ubuntu

I have noticed a problem with a couple of PDF files and mupdf. I cannot share the PDF as it is, but I still want to help the developers of mupdf to understand the problem. I hoped that I can delete/replace the content of the PDF so that I can share it.

peepdf gives me:

$ peepdf input.pdf
File: input.pdf
MD5: 243d9decc63d45866dcdcb18ca0ff686
SHA1: f025ee7fc151dc8241464bf78eab2f8b8692dba1
SHA256: c604a4eb5fe3b657543b1330fc98c5d3d64e8b4c16821dcba2c3123fbcb025da
Size: 212245 bytes
Version: 1.5
Binary: True
Linearized: False
Encrypted: False
Updates: 0
Objects: 101
Streams: 7
URIs: 0
Comments: 0
Errors: 1

Version 0:
        Catalog: 1
        Info: 2
        Objects (101): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101]
        Streams (7): [3, 10, 16, 44, 46, 100, 101]
                Encoded (6): [10, 16, 44, 46, 100, 101]
                Decoding errors (6): [10, 16, 44, 46, 100, 101]        

I hoped I could create a new PDF which is identical to the current one, except that e.g. Stream 44 is not in there. I would do this to get a minimal PDF which shows the error.

I've already removed all pages except for one page.

(The solution has to work on Ubuntu; preferably via Python)

I can't share the original PDF, but we can use this one as an example PDF file

like image 822
Martin Thoma Avatar asked Oct 15 '25 14:10

Martin Thoma


2 Answers

Using pikepdf, you can delete the object. With the provided example file, the object is referenced in the PDF's /Catalog/StructTreeRoot.

import pikepdf
with pikepdf.open("file.pdf") as p:
    del p.Root.StructTreeRoot
    p.save("file_without_structtreeroot.pdf")

You cannot delete the object by object number as easily in pikepdf's object model, because the object is still referenced by other objects in the PDF. Instead, you have to delete any references to the object, and then cull the unreferenced objects. If objects are multiply referenced, you will need to locate the other references.

(If you use pikepdf.Pdf.get_object((44, 0)), you'll obtain a new reference to object (44, 0). When you delete it, you'll only delete the new reference you created.)

like image 84
jbarlow Avatar answered Oct 18 '25 05:10

jbarlow


I needed to delete an object to remove confidential information from a pdf file: https://issues.apache.org/jira/browse/PDFBOX-5247

The first issue is identifying which object you are interested in. It might be possible to pick it from the structure of the file without decoding the streams. I decoded the pdf with:

java -jar pdfbox-app-2.y.z.jar WriteDecodedDoc <input-file> <output-file>

The decoded file can be examined but the objects are still encrypted. For example, text characters will be mapped to other characters in the cmap tables.

In my case it was easy to recognise the strings based on fonts, repeating space characters and line lengths. Text is in blocks between BT/ET with the actual text line ending in Tj (show text). I deleted object 8 with the code below:

8 0 obj
   (Deleted)
endobj

At the end of pdf files there is a table that references the location of each object in the file. Either update the table or overwrite the characters to be removed maintaining the location of the objects. I used a hex editor to overwrite the rest of the block in the file with nul characters.

Extracts from the file using font and cmap to decode characters to text:

...

BT
/PFLLBD+TimesNewRomanPSMT 11.00000 Tf
8.00 11.00 Td
0.50196 0.50196 0.50196 rg
0.26403 Tc
 ( * H Q H U D W H G  $ W      $ S U                $ 0)  Tj
 ET

...

25 0 obj
<<
/Type /Font
/Subtype /Type0
/BaseFont /PFLLBD+TimesNewRomanPSMT
/Name /PFLLBD+TimesNewRomanPSMT
/DescendantFonts [29 0 R]
/ToUnicode 30 0 R
/Encoding /Identity-H
>>
endobj

...

30 0 obj
<<
/Length 1317
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe)/Ordering (UCS)/Supplement 0>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0003><00b4>
endcodespacerange
52 beginbfrange
<0003><0003><00A0>
<000B><000B><0028>
<000C><000C><0029>
<000F><000F><002C>
<0010><0010><00AD>
<0011><0011><002E>
<0013><0013><0030>
<0014><0014><0031>
<0015><0015><0032>
<0017><0017><0034>
<001D><001D><003A>
<0024><0024><0041>
<0025><0025><0042>
<0026><0026><0043>
<0029><0029><0046>
<002A><002A><0047>
<002B><002B><0048>
<002C><002C><0049>
<002D><002D><004A>
<0030><0030><004D>
<0032><0032><004F>
<0033><0033><0050>
<0035><0035><0052>
<0036><0036><0053>
<0037><0037><0054>
<003A><003A><0057>
<003B><003B><0058>
<0044><0044><0061>
<0045><0045><0062>
<0046><0046><0063>
<0047><0047><0064>
<0048><0048><0065>
<0049><0049><0066>
<004A><004A><0067>
<004B><004B><0068>
<004C><004C><0069>
<004D><004D><006A>
<004E><004E><006B>
<004F><004F><006C>
<0050><0050><006D>
<0051><0051><006E>
<0052><0052><006F>
<0053><0053><0070>
<0054><0054><0071>
<0055><0055><0072>
<0056><0056><0073>
<0057><0057><0074>
<0058><0058><0075>
<0059><0059><0076>
<005A><005A><0077>
<005C><005C><0079>
<00B4><00B4><201D>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

endstream
endobj


Tj line as Hex

28 00 2a 00 48 00 51 00 48 00 55 00 44 00 57 00 
48 00 47 00 03 00 24 00 57 00 1d 00 03 00 13 00 
14 00 03 00 24 00 53 00 55 00 03 00 15 00 13 00 
15 00 14 00 03 00 13 00 14 00 1d 00 13 00 13 00 
1d 00 14 00 17 00 03 00 24 00 30 29 20 20 54 6a 
like image 24
flywire Avatar answered Oct 18 '25 05:10

flywire



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!