I work for a museum with hundreds of scientific paper pdfs sitting in a directory. I have OCR'd all of them so that they can be searched for keywords in programs like Adobe Reader. I need to write a program that will allow me to search this directory for a specific species name and generate a list of the documents that match the keyword, and the corresponding page number.
I am looking for a pdf library that I can accomplish this task with that is (hopefully) free. I wrote a small program using the PDFOne Library but the search took about 10 minutes to search for one term across the directory. I would like to cut the time down significantly as Adobe Reader and PDF-XchangeViewer can perform the same search in under a minute. I do not have a preference on language to use.
Can anyone direct me to the right resources so I may accomplish this task? Thanks.
When a PDF is opened in the Acrobat Reader (not in a browser), the search window pane may or may not be displayed. To display the search/find window pane, use "Ctrl+F".
First we need to import the PyPDF2 lib using this code: import PyPDF2 as pdf and be careful from the case-sensitivity. Then define the path of the folder using os. listdir('the path') and you should name it i.e. path = os. listdir('the path') .
I suggest that you evaluate the use of Apache Solr - which can index PDF files very efficiently.
http://lucene.apache.org/solr/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With