Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Tika via PHP when both installed on one server?

  • I need to make an internal website which allows users to upload .doc, .pdf, .xls files and see the text in a textarea box.
  • I have created the site in PHP to the point where a user can upload the files.
  • I have installed Tika on my server and at the command line can type java -jar tika-app-1.10-SNAPSHOT.jar -m manu.pdf > output.txt which successfully creates the text I need in the output file.

What is the best way to call Tika from PHP in order to get the plain text of an uploaded file into PHP?

Searching around I find:

  1. PHP code that makes calls to a "Tika server" e.g. with cURL
  2. PHP Wrapper classes for Tika which seem to use Tika on the same server that PHP is installed on, but I have not gotten any of them to work.
  3. Alternatively, I could simply call Tika via the exec command.

But I'm not sure what is the easiest way to proceed.

like image 592
Edward Tanguay Avatar asked Jun 04 '15 14:06

Edward Tanguay


2 Answers

Simpler approach (call API)

For running on a remote server I suggest you to use curl or Guzzle to call the address (but you could also simply use file_get_contents and pass it the URL for the API that will call Tika on the remote server.

Other approach (execute process on local server)

For running the parsing on local (Tika and PHP on same server) I used Synfony/Process.

I'd, personally, discourage you from just using exec.


I would add that having Tika on another server will force you to send this server the whole file payload uploaded from the user. While a faster solution would be to just receive the upload, with PHP execution, and directly call the Tika process from the same script (or at least from the same machine). Otherwise you need a script that:

  • receives the uploaded data
  • uploads that to the Tika server (maybe as payload of an API call)
  • tells to Tika (through API) on the remote server to parse the file
  • downloads the response parsed data
  • works with it or display it.

As I highlighed there will be a lot more overhead just as communication between the two servers; and that is not desirable when the file to parse is maybe a 35MB pdf-file, is it? The user would have to wait, let's say, 2 minutes for the upload, PLUS other, let's say, 20 seconds to send the file to the Tika server, and then other, let's say 3 seconds to get the text-format parsed result.

I strongly suggest to stay and work on the same PHP server.

like image 163
Kamafeather Avatar answered Oct 13 '22 14:10

Kamafeather


If it is on your own managed servers, and both PHP and Tika locations are known to you, just use exec. Or if you prefer better control (which I suspect you do not need) use shell_exec
If you have some performance issues, and/or need to scale this thing, then there is room for a more elaborate solution.

like image 31
Itay Moav -Malimovka Avatar answered Oct 13 '22 14:10

Itay Moav -Malimovka