Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Programmatically search multiple PDF files for keyword and note page number

Tags:

search

pdf

I work for a museum with hundreds of scientific paper pdfs sitting in a directory. I have OCR'd all of them so that they can be searched for keywords in programs like Adobe Reader. I need to write a program that will allow me to search this directory for a specific species name and generate a list of the documents that match the keyword, and the corresponding page number.

I am looking for a pdf library that I can accomplish this task with that is (hopefully) free. I wrote a small program using the PDFOne Library but the search took about 10 minutes to search for one term across the directory. I would like to cut the time down significantly as Adobe Reader and PDF-XchangeViewer can perform the same search in under a minute. I do not have a preference on language to use.

Can anyone direct me to the right resources so I may accomplish this task? Thanks.

like image 284
Alex Vizzone Avatar asked Sep 11 '13 10:09

Alex Vizzone


People also ask

How do you search for keywords in a PDF?

When a PDF is opened in the Acrobat Reader (not in a browser), the search window pane may or may not be displayed. To display the search/find window pane, use "Ctrl+F".

How do you extract data from 100s PDFs in 2 minutes using Python?

First we need to import the PyPDF2 lib using this code: import PyPDF2 as pdf and be careful from the case-sensitivity. Then define the path of the folder using os. listdir('the path') and you should name it i.e. path = os. listdir('the path') .


1 Answers

I suggest that you evaluate the use of Apache Solr - which can index PDF files very efficiently.

http://lucene.apache.org/solr/

like image 193
tbsalling Avatar answered Nov 15 '22 12:11

tbsalling