Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting titles from PDF files?

Tags:

python

pdf

I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.

like image 562
ZelluX Avatar asked May 26 '09 16:05

ZelluX


People also ask

Can you pull metadata from a PDF?

How to view PDF metadata? Open the concerned PDF document in Adobe Acrobat and go to File > Properties > Description. It will show you a window that consists of different components of the metadata of the concerned PDF document.

Can you extract text from a PDF image?

You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.


2 Answers

You could try to use pyPdf and this example.

for example:

from pyPdf import PdfFileWriter, PdfFileReader

def get_pdf_title(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f) 
        return pdf_reader.getDocumentInfo().title

title = get_pdf_title('/home/user/Desktop/my.pdf')
like image 192
schnaader Avatar answered Oct 18 '22 05:10

schnaader


Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).

Once you have the arXiv reference number (and have done a pip install arxiv), you can get the title using

paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title
like image 23
mathandy Avatar answered Oct 18 '22 03:10

mathandy