Is there a way to disable OCR mode in Tika without uninstalling tesseract

Question

I am using tika-app jar for my project and is there a way to disable tesseract OCR in tika. There are two things which has to be kept as such:

1.tesseract cannot be uninstalled

2.tika.xml can't be edited, as tika-app.jar is used off the shelf

Is there a way to set the configuration in the java code by setting the context or parser property to disable OCR?

I tried the below code but still OCR extracts the text from image files while parsing.

            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setOcrStrategy(OCR_STRATEGY.NO_OCR);
            context.set(PDFParserConfig.class, pdfConfig);```

suraj huljute · Accepted Answer

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>

Is there a way to disable OCR mode in Tika without uninstalling tesseract

Tags:

java

ocr

tesseract

apache-tika

Santhosh

1 Answers

suraj huljute

Recent Activity

Donate For Us

Is there a way to disable OCR mode in Tika without uninstalling tesseract

Tags:

java

ocr

tesseract

apache-tika

Santhosh

1 Answers

suraj huljute

Related questions

Recent Activity

Donate For Us