Apache Tika extract scanned PDF files

Tags:

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;
}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing. Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

247

asked Sep 02 '15 13:09

LorisBachert

1 Answers

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT: Here is the complete solution:

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample {
    public static void main(String[] args)
            throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");
    }
}

Maven Dependencies:

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>

147

answered Nov 15 '22 14:11

LorisBachert

Related questions
                            
                                Is there anything like JSON.stringify in Jackson?
                            
                                hashcode implementation on boolean fields
                            
                                how to get a class reference to parameterized type
                            
                                Where is ${body_statement} defined in Eclipse
                            
                                Why do I need to provide enclosing class object but not enclosed class object
                            
                                Getting specific file from ZipInputStream
                            
                                How do you use 3D graphics in Android Studio? [closed]
                            
                                Counting objects with a same property value
                            
                                Gradle: How to perform git pull through gradle?
                            
                                Difference between newScheduledThreadPool(1) and newSingleThreadScheduledExecutor()
                            
                                Error when verifying ECDSA signature in Java with BouncyCastle
                            
                                Strange optimization of "if" conditions in Java
                            
                                Null id property when deserialize json with jackson and Jackson2HalModule of Spring Hateoas
                            
                                Passing method as a parameter - Is this possible?
                            
                                Autowired in CustomInterceptor getting null(Spring Boot) [duplicate]
                            
                                Is there a way to write a rest controller to upload file using spring-data-rest without using Spring-MVC?
                            
                                How to finish() two activities at the same time?
                            
                                Spring Data JPA Update Method
                            
                                How to disable checkstyle JavaDoc validation for constructors?
                            
                                Replacing traditional newForLoop with Java 8 Streams

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Tika extract scanned PDF files

Tags:

java

pdf

ocr

tesseract

apache-tika

LorisBachert

People also ask

1 Answers

LorisBachert

Recent Activity

Donate For Us