I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.
How to view PDF metadata? Open the concerned PDF document in Adobe Acrobat and go to File > Properties > Description. It will show you a window that consists of different components of the metadata of the concerned PDF document.
You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.
You could try to use pyPdf and this example.
for example:
from pyPdf import PdfFileWriter, PdfFileReader
def get_pdf_title(pdf_file_path):
with open(pdf_file_path) as f:
pdf_reader = PdfFileReader(f)
return pdf_reader.getDocumentInfo().title
title = get_pdf_title('/home/user/Desktop/my.pdf')
Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).
Once you have the arXiv reference number (and have done a pip install arxiv
), you can get the title using
paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With