Programmatically search multiple PDF files for keyword and note page number

Tags:

I work for a museum with hundreds of scientific paper pdfs sitting in a directory. I have OCR'd all of them so that they can be searched for keywords in programs like Adobe Reader. I need to write a program that will allow me to search this directory for a specific species name and generate a list of the documents that match the keyword, and the corresponding page number.

I am looking for a pdf library that I can accomplish this task with that is (hopefully) free. I wrote a small program using the PDFOne Library but the search took about 10 minutes to search for one term across the directory. I would like to cut the time down significantly as Adobe Reader and PDF-XchangeViewer can perform the same search in under a minute. I do not have a preference on language to use.

Can anyone direct me to the right resources so I may accomplish this task? Thanks.

284

asked Sep 11 '13 10:09

Alex Vizzone

1 Answers

I suggest that you evaluate the use of Apache Solr - which can index PDF files very efficiently.

http://lucene.apache.org/solr/

193

answered Nov 15 '22 12:11

tbsalling

Related questions
                            
                                The correct way of adding custom query parameter in Solr
                            
                                Lazy load items with filtering
                            
                                Unable to perform Graph user search using API Explorer
                            
                                Vietnamese Unicode Text Search in SQLite
                            
                                Search XML with JavaScript and Display Results in Table
                            
                                OLE DB provider "Search.CollatorDSO" returns "Command was not prepared"
                            
                                What is the Sherwood binary search algorithm in Java?
                            
                                Using ActiveRecord to achieve complex relations in Rails
                            
                                Solr search dash in part number
                            
                                Github Search: how to search in multiple languages
                            
                                How to implement search like Stack Overflow
                            
                                Fastest full text search today?
                            
                                Searching SimpleDB in a case-insensitive way
                            
                                Android search dialog doesn't appear
                            
                                Having some a map and some root we'd like to follow what standard algorithm would help in creating path?
                            
                                Find all references for user control in aspx or ascx files
                            
                                How can I use the Xpath function 'contains()' to return nothing if it's search param is blank or missing/false?
                            
                                Google Search API sometimes reurns first result sometimes second result compared to web interface
                            
                                Is it possible to search twitter users by url?
                            
                                Fast and efficient computation on arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Programmatically search multiple PDF files for keyword and note page number

Tags:

search

pdf

Alex Vizzone

People also ask

1 Answers

tbsalling

Recent Activity

Donate For Us