Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python how to use tika with existing jar file without downloading again

I'm using Tika and I realized that each time the jar file is downloaded and placed in Temp folder

Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar to C:\Users\asus\AppData\Local\Temp\tika-server.jar.
Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.19/tika-server-1.19.jar.md5 to C:\Users\asus\AppData\Local\Temp\tika-server.jar.md5.

The problem is that the jar file size is around 60MB, which takes some time to download.

This is the code I'm using :

from tika import parser

def get_pdf_text(path):
    parsed = parser.from_file(path):
    return parsed['content']

The only workaround I found is this :

1 - Manually running the jar using java -jar tika-server-x.x.jar --port xxxx

2 - Using tika.TikaClientOnly = True

3 - Replacing parser.from_file(path) with parser.from_file(path, '/path/to/server')

But I don't want to run the jar file manually. It would be better if I can use Python to automatically run the jar file and setup tika with it without redownloading.

like image 416
Michael Fish Avatar asked Jun 12 '19 10:06

Michael Fish


People also ask

How do I use Apache Tika in Python?

Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language. Installation: To install Tika type the below command in the terminal. For extracting contents from the PDF files we will use from_file() method of parser object.

How do I know if Tika server is running?

Running the Tika Server as a Jar file Once the server is running, you can visit the server's URL in your browser (eg http://localhost:9998/ ), and the basic welcome page will confirm that the Server is running, and give links to the various endpoints available.

What does Tika Parser do?

tika. parser. Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents.


2 Answers

if you don't want to add environment variable, you can change the directory that the tika looking for tika-server.jar file with code bellow.

from tika import tika
tika.TikaJarPath = r'TIKA_SERVER_PATH'

in that TIKA_SERVER_PATH the jar file name should be tika-server.jar(the name shouldn't include the version) and also the .md5 file must be there. if the .md5 file isn't the right version as tika-server.jar this method doesn't work and tika will delete your file and download the default version.

like image 159
bolbol Avatar answered Oct 20 '22 16:10

bolbol


To resolve this problem you should add an environment variable to the tika server jar and specify the path folder which contains the tika jar file.

TIKA_SERVER_JAR = 'PATH_OF_FOLDER_CONTAINING_TIKA_SERVER_JAR'.

like image 44
Rafikoo Saidoo Avatar answered Oct 20 '22 18:10

Rafikoo Saidoo