Extracting information from PDFs of research papers [closed]

Tags:

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it.

At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting out the references would be amazing.

Ideally this would be an open source solution.

The problem is that not all PDF's encode the text, and many which do fail to preserve the logical order of the text, so just doing pdf2text gives you line 1 of column 1, line 1 of column 2, line 2 of column 1 etc.

I know there's a lot of libraries. It's identifying the abstract, title authors etc. on the document that I need to solve. This is never going to be possible every time, but 80% would save a lot of human effort.

639

asked Nov 28 '09 19:11

Christopher Gutteridge

1 Answers

I'm only allowed one link per posting so this is it: pdfinfo Linux manual page

This might get the title and authors. Look at the bottom of the manual page, and there's a link to www.foolabs.com/xpdf where the open source for the program can be found, as well as binaries for various platforms.

To pull out bibliographic references, look at cb2bib:

cb2Bib is a free, open source, and multiplatform application for rapidly extracting unformatted, or unstandardized bibliographic references from email alerts, journal Web pages, and PDF files.

You might also want to check the discussion forums at www.zotero.org where this topic has been discussed.

110

answered Sep 29 '22 23:09

MZB

Related questions
                            
                                Why does bottom table cell that has centered-text get cut off when displayed as PDF in iOS?
                            
                                additional options in Chrome headless print-to-pdf
                            
                                Is there a way to leave the Okular highlighting tool on permanently? [closed]
                            
                                How to open PDF raw?
                            
                                How to convert PDF files to images
                            
                                Make text wrap in a cell with FPDF?
                            
                                How to convert a PDF into JPG with command line in Linux? [closed]
                            
                                How to force a pdf download automatically?
                            
                                Inserting line breaks into PDF
                            
                                Application (Not a Markup Language) for Producing a User Manual [closed]
                            
                                Android open pdf file
                            
                                Are there any Java PDF creation alternatives to iText? [closed]
                            
                                Extract text from pdf file using javascript [duplicate]
                            
                                How to embed PDF file with responsive width
                            
                                PDF to image using Java [duplicate]
                            
                                PDF.js scale PDF on fixed width
                            
                                How to write shell script for finding number of pages in PDF?
                            
                                From Markdown to PDF: how to change the font-size with Pandoc?
                            
                                Is it possible to select text in zathura without the mouse?
                            
                                Using iText to convert HTML to PDF

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting information from PDFs of research papers [closed]

Tags:

pdf

metadata

extraction

Christopher Gutteridge

People also ask

1 Answers

MZB

Recent Activity

Donate For Us