Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract the text from URLs using TIKA

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

like image 471
arsenal Avatar asked Jul 11 '11 21:07

arsenal


2 Answers

Check the documentation - yes you can.

Example

java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika

will show you the text on this page.

like image 182
fvu Avatar answered Sep 24 '22 01:09

fvu


This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();
like image 6
surajz Avatar answered Sep 21 '22 01:09

surajz