Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Question

Seems like Solr is not parsing my PDF files correctly. I was wondering if there is any other alternative to using Apache Tika (which I believe uses PDFBox internally) for parsing PDF files? I seem to be getting random spaces in between my content when using this. I have isolated the problem by running PDF through PDFBox directly (latest version) which has the same problem.

Some OCR commercial software such as Omnifind work on PDF fine but we are not able to integrate them with Solr in same way and buying is not an option either.

Tom De Leu · Accepted Answer

As the answer to this SO question indicates, this is due to the nature of the PDF format itself.

It's possible that OCR options do better on this problem than PDFBox, there are some free OCR options available like Tesseract and Ocropus but I have no idea how well they work or if they can be easily integrated with Solr.

Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Tags:

solr

apache-tika

pdfbox

full-text-indexing

document-conversion

Ravish Bhagdev

1 Answers

Tom De Leu

Recent Activity

Donate For Us

Alternative to Tika/PDFBox for parsing PDF in Solr (any version later than 1.4)

Tags:

solr

apache-tika

pdfbox

full-text-indexing

document-conversion

Ravish Bhagdev

1 Answers

Tom De Leu

Related questions

Recent Activity

Donate For Us