Is it possible to extract text from URLs with Tika? Any links will be appreciated. Or TIKA is usable only for pdf, word and any other media documents?
Check the documentation - yes you can.
Example
java -jar tika-app-0.9.jar -t http://stackoverflow.com/questions/6656849/extract-the-text-from-url-using-tika
will show you the text on this page.
This is from lucid:
InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());
Instead of creating a PDFParser
you can use Tika's AutoDetectParser
to automatically process diff types of files:
Parser parser = new AutoDetectParser();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With