Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching (extracting text) PDF files with Algolia

This is just a speculative idea for a client who has a lot of PDF files.

Algolia say in their FAQs that to search PDF files you first need to extract the text from the file. How would you go about this?

The way I envisage the a system working would be:

  • Client uploads PDF via CMS
  • CMS calls some service / program to extract the text
  • Algolia indexes the extracted and it's somehow linked to the original PDF

It would need to be an automated system as the client shouldn't have to tell it to index. It would be built in PHP, probably Laravel running on Ubuntu.

What software / service could do the text extraction from the PDFs and is any magic needed to 'link' this with the PDF file?

I'm also happy to have suggestions on other search services which may handle this.

like image 981
Ric Avatar asked Jan 23 '26 17:01

Ric


2 Answers

Fortunately, text extraction from pdf's is a subject that has been covered multiple times. On the command line, you could use pdftotext (available on Linux or Mac) or in your code a library as Apache Tika (for which you can find a PHP wrapper).

To avoid having too much noise in your records, I'd recommend you to then split the text and create one record per paragraph. You can then use Algolia's distinct feature to deduplicate the results.

You should already have the links to your files somewhere, just store them in your records and then, in your front-end you'll easily be able to create links to them using for instance autocomplete.js or instantsearch.js .

like image 147
Jerska Avatar answered Jan 26 '26 09:01

Jerska


For anyone still looking for a solution, I put together a GitHub repository that does exactly that: https://github.com/PDFTron/pdftron-document-search.

The text extraction happens client-side as the user uploads the document using React + Firebase + Algolia.

You can check out a quick video walking you through the sample app: https://youtu.be/IQATnzHTp7Q.

Let me know if you have any questions.

like image 43
Andrey Avatar answered Jan 26 '26 08:01

Andrey



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!