Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing metadata to a pdf using pyobjc

I'm trying to write metadata to a pdf file using the following python code:

from Foundation import *
from Quartz import *

url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)
assert pdfdoc, "failed to create document"

print "reading pdf file"

attrs = {}
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"

PDFDocumentTitleAttribute = "test"

pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")   

print "pdf made"

This appears to work fine (no errors to the consoled), however when I examine the metadata of the file it is as follows:

PdfID0:
242b7e252f1d3fdd89b35751b3f72d3
PdfID1:
242b7e252f1d3fdd89b35751b3f72d3
NumberOfPages: 4

and the original file had the following metadata:

InfoKey: Creator
InfoValue: PScript5.dll Version 5.2.2
InfoKey: Title
InfoValue: Microsoft Word - PROGRESS  ON  THE  GABION  HOUSE Compressed.doc
InfoKey: Producer
InfoValue: GPL Ghostscript 8.15
InfoKey: Author
InfoValue: PWK
InfoKey: ModDate
InfoValue: D:20101021193627-05'00'
InfoKey: CreationDate
InfoValue: D:20101008152350Z
PdfID0: d5fd6d3960122ba72117db6c4d46cefa
PdfID1: 24bade63285c641b11a8248ada9f19
NumberOfPages: 4

So the problems are, it is not appending the metadata, and it is clearing the previous metadata structure. What do I need to do to get this to work? My objective is to append metadata that reference management systems can import.

like image 597
djq Avatar asked Nov 04 '10 19:11

djq


People also ask

How to display metadata from a PDF file in Python?

#!usr/bin/env python # This program displays metadata from pdf file import pyPdf def main (): # Enter the location of 'ANONOPS_The_Press_Release. pdf' # Download the PDF if you haven't already filename = pdfFile = pyPdf. PdfFileReader (file (filename,'rb')) data = pdfFile.

Can pypdf be used to extract information from a PDF file?

We were able to get some helpful information from PDFs using it. I could see using PyPDF on a folder of PDFs and using the metadata extraction technique to sort out the PDFs by creator name, subject, etc. Give it a try and see what you think!

What is the best way to convert a PDF to Python?

There are lots of PDF-related packages for Python. One of my favorites is PyPDF2. You can use it to extract metadata, rotate pages, split or merge PDFs, and more. It's kind of a Swiss-army knife for existing PDFs.

How to install pypdf in Python?

We need to install yet another python module known as pyPdf. To install it, just follow the steps: Download pyPdf tar.gz file from here. Extract the tar.gz file using the following command: tar -xvzf 'filename' Now change your directory to the freshly extracted folder.


1 Answers

Mark is on the right track, but there are a few peculiarities that should be accounted for.

First, he is correct that pdfdoc.documentAttributes is an NSDictionary that contains the document metadata. You would like to modify that, but note that documentAttributes gives you an NSDictionary, which is immutable. You have to convert it to an NSMutableDictionary as follows:

attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfDoc.documentAttributes())

Now you can modify attrs as you did. There is no need to write PDFDocument.PDFDocumentTitleAttribute as Mark suggested, that one won't work, PDFDocumentTitleAttribute is declared as a module-level constant, so just do as you did in your own code.

Here is the full code that works for me:

from Foundation import *
from Quartz import *

url = NSURL.fileURLWithPath_("test.pdf")
pdfdoc = PDFDocument.alloc().initWithURL_(url)

attrs = NSMutableDictionary.alloc().initWithDictionary_(pdfdoc.documentAttributes())
attrs[PDFDocumentTitleAttribute] = "THIS IS THE TITLE"
attrs[PDFDocumentAuthorAttribute] = "A. Author and B. Author"

pdfdoc.setDocumentAttributes_(attrs)
pdfdoc.writeToFile_("mynewfile.pdf")
like image 96
Tamás Avatar answered Sep 28 '22 05:09

Tamás