I want to add a metadata key-value pair to the metadata of a pdf file.
I found a several years old answer, but I think this is way to complicated. I guess there is an easier way today: https://stackoverflow.com/a/3257340/633961
I am not married with pypdf2, if there is an easier way, then I go this way?
Choose File > Properties, click the Description tab, and then click Additional Metadata. Select Advanced from the list on the left. To edit the metadata, do any of the following, and then click OK. To add previously saved information, click Append, select an XMP or FFO file, and click Open.
I was surprised to see there is no code sample for PyPDF2 when the questions is explicitly asking for PyPDF2, so here it is:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("source.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
metadata = reader.getDocumentInfo()
writer.addMetadata(metadata)
# Write your custom metadata here:
writer.addMetadata({"/Some": "Example"})
with open("result.pdf", "wb") as fp:
writer.write(fp)
You can do that using pdfrw
pip install pdfrw
Then run
from pdfrw import PdfReader, PdfWriter
trailer = PdfReader("myfile.pdf")
trailer.Info.WhoAmI = "Tarun Lalwani"
PdfWriter("edited.pdf", trailer=trailer).write()
And then check the PDF Custom Properties
There are several ways to edit PDF metadata in Python, but one way is better than the others.
I will start by talking about other ways that seem right but have side effects. Skip to the end of this article if you don’t have enough time and just use the correct way.
Weakness is package not maintained.
from pdfrw import PdfReader, PdfWriter, PdfDict
if __name__ == '__main__':
pdf_reader = PdfReader('old.pdf')
metadata = PdfDict(Author='Someone', Title='PDF in Python')
pdf_reader.Info.update(metadata)
PdfWriter().write('new.pdf', pdf_reader)
pdfrw can do quite easily without losing non-display information such as bookmarks.
PyPDF2 supports more PDF features than pdfrw, including decryption and more types of decompression.
Weakness is PDF not preserve outlines(bookmarks).
import pprint
from PyPDF2 import PdfFileReader, PdfFileWriter
if __name__ == '__main__':
file_in = open('old.pdf', 'rb')
pdf_reader = PdfFileReader(file_in)
metadata = pdf_reader.getDocumentInfo()
pprint.pprint(metadata)
pdf_writer = PdfFileWriter()
pdf_writer.appendPagesFromReader(pdf_reader)
pdf_writer.addMetadata({
'/Author': 'Someone',
'/Title': 'PDF in Python'
})
file_out = open('new.pdf', 'wb')
pdf_writer.write(file_out)
file_in.close()
file_out.close()
Using PdfFileWriter
create a new PDF, and get old contents through appendPagesFromReader()
, then addMetadata()
.
It seems that we cannot directly modify the PDF metadata, so we add all pages and metadata then write out to a new file.
The correct way to edit PDF metadata in Python.
import pprint
from PyPDF2 import PdfFileReader, PdfFileMerger
if __name__ == '__main__':
file_in = open('old.pdf', 'rb')
pdf_reader = PdfFileReader(file_in)
metadata = pdf_reader.getDocumentInfo()
pprint.pprint(metadata)
pdf_merger = PdfFileMerger()
pdf_merger.append(file_in)
pdf_merger.addMetadata({
'/Author': 'Someone',
'/Title': 'PDF in Python'
})
file_out = open('new.pdf', 'wb')
pdf_merger.write(file_out)
file_in.close()
file_out.close()
Using PdfFileMerger
concatenate pages through append()
.
append(fileobj, bookmark=None, pages=None, import_bookmarks=True)
pdfrw: the other Python PDF library
Reading and writing pdf metadata
Building on what Cyril N. stated, the code works fine, but it creates a lot of "trash" files since now you have the original file and the file with the metadata.
I changed the code a bit since I will run this on hundreds of files a day, and don't want to deal with the additional clean-up:
from PyPDF2 import PdfFileReader, PdfFileWriter
reader = PdfFileReader("your_original.pdf")
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
metadata = reader.getDocumentInfo()
writer.addMetadata(metadata)
# Write your custom metadata here:
writer.addMetadata({"/Title": "this"})
with open("your_original.pdf", "ab") as fout:
# ab is append binary; if you do wb, the file will append blank pages
writer.write(fout)
If you do want to have it as a new file, just use a different name for the pdf in fout and keep ab. If you use wb, you will append blank pages equal to your original file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With