Extract the text from URLs using TIKA

Question

Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?

fvu · Accepted Answer

Check the documentation - yes you can.

Example

java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika

will show you the text on this page.

surajz · Answer

This is from lucid:

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Instead of creating a PDFParser you can use Tika's AutoDetectParser to automatically process diff types of files:

Parser parser = new AutoDetectParser();

Extract the text from URLs using TIKA

Tags:

java

apache-tika

arsenal

2 Answers

fvu

surajz

Recent Activity

Donate For Us

Extract the text from URLs using TIKA

Tags:

java

apache-tika

arsenal

2 Answers

fvu

surajz

Related questions

Recent Activity

Donate For Us