Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert CreationTime of PDF to a readable format in Python

I'm working on PDF with Python and I'm accessing the file's meta data by using PDFMiner. I extract the info using this:

from pdfminer.pdfparser import PDFParser, PDFDocument    
fp = open('diveintopython.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
doc.initialize()

print doc.info[0]['CreationDate']
# And return this value "D:20130501200439+01'00'"

How can I convert D:20130501200439+01'00' into a readable format in Python?

like image 242
kimbebot Avatar asked May 12 '13 00:05

kimbebot


People also ask

How read data from PDF in Python?

We opened the example. and saved the file object as pdfFileObj. Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object. numPages property gives the number of pages in the pdf file. For example, in our case, it is 20 (see first line of output).


1 Answers

I found the format documented here. I needed to cope with the timezones too because I have 160k documents from all over to deal with. Here is my full solution:

import datetime
import re
from dateutil.tz import tzutc, tzoffset


pdf_date_pattern = re.compile(''.join([
    r"(D:)?",
    r"(?P<year>\d\d\d\d)",
    r"(?P<month>\d\d)",
    r"(?P<day>\d\d)",
    r"(?P<hour>\d\d)",
    r"(?P<minute>\d\d)",
    r"(?P<second>\d\d)",
    r"(?P<tz_offset>[+-zZ])?",
    r"(?P<tz_hour>\d\d)?",
    r"'?(?P<tz_minute>\d\d)?'?"]))


def transform_date(date_str):
    """
    Convert a pdf date such as "D:20120321183444+07'00'" into a usable datetime
    http://www.verypdf.com/pdfinfoeditor/pdf-date-format.htm
    (D:YYYYMMDDHHmmSSOHH'mm')
    :param date_str: pdf date string
    :return: datetime object
    """
    global pdf_date_pattern
    match = re.match(pdf_date_pattern, date_str)
    if match:
        date_info = match.groupdict()

        for k, v in date_info.iteritems():  # transform values
            if v is None:
                pass
            elif k == 'tz_offset':
                date_info[k] = v.lower()  # so we can treat Z as z
            else:
                date_info[k] = int(v)

        if date_info['tz_offset'] in ('z', None):  # UTC
            date_info['tzinfo'] = tzutc()
        else:
            multiplier = 1 if date_info['tz_offset'] == '+' else -1
            date_info['tzinfo'] = tzoffset(None, multiplier*(3600 * date_info['tz_hour'] + 60 * date_info['tz_minute']))

        for k in ('tz_offset', 'tz_hour', 'tz_minute'):  # no longer needed
            del date_info[k]

        return datetime.datetime(**date_info)
like image 52
Paul Whipp Avatar answered Nov 14 '22 21:11

Paul Whipp