Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tika AutoDetectParser returning empty string?

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar.

AutoDetect Parser returns emptry string here :

BodyContentHandler handler = new BodyContentHandler();  
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
parser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

Further confusing me is the fact that using a standard PDFParser works fine...:

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
FileInputStream mypdfstream = new FileInputStream(new File("mypdf.pdf"));
PDFParser pdfparser = new PDFParser();
pdfparser.parse(mypdfstream,handler,metadata,context);
System.out.println(handler.toString());

I have included both the tika-app and tika-parsers jar on my classpath and included them within the jar created by ant.

relevant portions of build.xml

<javac srcdir="${src}" destdir="${build}">
                <classpath>
                        <pathelement path = "lib/tika-app-1.11.jar"/>
                        <pathelement path = "lib/tika-parsers-1.11.jar"/>
                </classpath>
 </javac>

<jar jarfile="${dist}/lib/MyProject-${DSTAMP}.jar" basedir="${build}">
        <zipgroupfileset dir="lib" includes="tika-app-1.11.jar"/>
        <zipgroupfileset dir="lib" includes="tika-parsers-1.11.jar"/>
</jar>

Edit: I looked at my list of supportedTypes with parser.getSupportTypes(context)) and it was empty. As is the list of parsers returned from parser.getParsers().

So perhaps this is yet another dependency issue? This truly surprises me given tika-app is included.

like image 240
Pat Avatar asked Dec 21 '15 20:12

Pat


1 Answers

I have the same issue, i have corrected adding the Tika Core and Parser dependency on my Pom.xml like this again and then Update Maven on Eclipse.

    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-core</artifactId>
      <version>1.18</version>
    </dependency>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.18</version>
    </dependency>
like image 188
Leonardo Bouchan Avatar answered Oct 09 '22 02:10

Leonardo Bouchan