Tika - retrieve main content from docs

Question

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url. So that I can incorporate that method in my program. Also whether they are using any heuristic algorithm while extracting data from HTML pages. Because sometimes in the extracted content, I can't able to see the advertisements.

UPDATE : I found out that BoilerPipeContentHandler is responsible for it.

Jukka Zitting · Accepted Answer

The "main content" feature in the Tika GUI is implemented using the BoilerpipeContentHandler class that relies on the boilerpipe library for the heavy lifting.

Tika - retrieve main content from docs

Tags:

java

apache-tika

CrazyCoder

1 Answers

Jukka Zitting

Recent Activity

Donate For Us

Tika - retrieve main content from docs

Tags:

java

apache-tika

CrazyCoder

1 Answers

Jukka Zitting

Related questions

Recent Activity

Donate For Us