Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the title of a PDF document from within a script for renaming?

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. "aluminum carbonate" for a0001.pdf, "aluminum nitrate" in a0002.pdf, etc., which I'd like to extract to rename my files.

I use this program to rename a file:

path=r"C:\Users\YANN\Desktop\..."

old='string 1'
new='string 2'

def rename(path,old,new):
    for f in os.listdir(path):
        os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))

rename(path,old,new)

I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?

like image 462
ParaH2 Avatar asked Jun 16 '17 22:06

ParaH2


2 Answers

Installing the package

This cannot be solved with plain Python. You will need an external package such as pdfrw, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip.

On Windows, first make sure you have a recent version of pip using the shell command:

python -m pip install -U pip

On Linux:

pip install -U pip

On both platforms, install then the pdfrw package using

pip install pdfrw

The code

I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:

import os
from pdfrw import PdfReader

path = r'C:\Users\YANN\Desktop'


def renameFileToPDFTitle(path, fileName):
    fullName = os.path.join(path, fileName)
    # Extract pdf title from pdf file
    newName = PdfReader(fullName).Info.Title
    # Remove surrounding brackets that some pdf titles have
    newName = newName.strip('()') + '.pdf'
    newFullName = os.path.join(path, newName)
    os.rename(fullName, newFullName)


for fileName in os.listdir(path):
    # Rename only pdf files
    fullName = os.path.join(path, fileName)
    if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
        continue
    renameFileToPDFTitle(path, fileName)
like image 94
Manu CJ Avatar answered Oct 14 '22 05:10

Manu CJ


What you need is a library that can actually read PDF files. For example pdfrw:

In [8]: from pdfrw import PdfReader

In [9]: reader = PdfReader('example.pdf')

In [10]: reader.Info.Title
Out[10]: 'Example PDF document'
like image 26
zeebonk Avatar answered Oct 14 '22 05:10

zeebonk