Extracting titles from PDF files?

Tags:

I want to write a script to rename downloaded papers with their titles automatically, I'm wondering if there is any library or tricks i can make use of? The PDFs are all generated by TeX and should have some 'formal' structures.

562

asked May 26 '09 16:05

ZelluX

2 Answers

You could try to use pyPdf and this example.

for example:

from pyPdf import PdfFileWriter, PdfFileReader

def get_pdf_title(pdf_file_path):
    with open(pdf_file_path) as f:
        pdf_reader = PdfFileReader(f) 
        return pdf_reader.getDocumentInfo().title

title = get_pdf_title('/home/user/Desktop/my.pdf')

192

answered Oct 18 '22 05:10

schnaader

Assuming all these papers are from arXiv, you could instead extract the arXiv id (I'd guess that searching for "arXiv:" in the PDF's text would consistently reveal the id as the first hit).

Once you have the arXiv reference number (and have done a pip install arxiv), you can get the title using

paper_ref = '1501.00730'
arxiv.query(id_list=[paper_ref])[0].title

answered Oct 18 '22 03:10

mathandy

Related questions
                            
                                Why doesn't '%matplotlib inline' work in python script?
                            
                                How can I delay the __init__ call until an attribute is accessed?
                            
                                AttributeError: module 'PyQt5.QtGui' has no attribute 'QWidget'
                            
                                How to get predicted values in Keras?
                            
                                what is meaning of hook that used in tensorflow
                            
                                pipenv and bash aliases
                            
                                Pandas - expand nested json array within column in dataframe
                            
                                Count frequency of item in a list of tuples
                            
                                Python OpenCV video format play in browser
                            
                                Difference between df[x], df[[x]], df['x'] , df[['x']] and df.x
                            
                                Unable to connect to kubernetes python api - .kube/config file not found
                            
                                how to get numeric column names in pandas dataframe
                            
                                Customizing the order of legends in plotly
                            
                                Where does spacy language model download?
                            
                                Python Class "Constants" in Dataclasses
                            
                                Which characters are considered whitespace by split()?
                            
                                get_config missing while loading previously saved model without custom layers
                            
                                str.isdigit() behaviour when handling strings
                            
                                Can't install Python package on Alpine Docker anymore [duplicate]
                            
                                Python's os.path choking on Hebrew filenames

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With