I just want to know that how i can extract main text and plain text from html using Tika?
maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it?
thanks very much in advance
The BodyContentHandler class doesn't use the Boilerpipe code, so you'll have to explicitly use the BoilerPipeContentHandler. The following code worked for me:
public String[] tika_autoParser() {
String[] result = new String[3];
try {
InputStream input = new FileInputStream(new File("test.html"));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, new BoilerpipeContentHandler(textHandler), metadata, context);
result[0] = "Title: " + metadata.get(metadata.TITLE);
result[1] = "Body: " + textHandler.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
return result;
}
Here is a sample:
public String[] tika_autoParser() {
String[] result = new String[3];
try {
InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf"));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(input, textHandler, metadata, context);
result[0] = "Title: " + metadata.get(metadata.TITLE);
result[1] = "Body: " + textHandler.toString();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
}
return result;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With