Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to disable OCR mode in Tika without uninstalling tesseract

I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```
like image 755
Santhosh Avatar asked Sep 13 '25 00:09

Santhosh


1 Answers

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>
like image 108
suraj huljute Avatar answered Sep 14 '25 15:09

suraj huljute