Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tika - retrieve main content from docs

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url. So that I can incorporate that method in my program. Also whether they are using any heuristic algorithm while extracting data from HTML pages. Because sometimes in the extracted content, I can't able to see the advertisements.

UPDATE : I found out that BoilerPipeContentHandler is responsible for it.

like image 980
CrazyCoder Avatar asked Feb 21 '23 14:02

CrazyCoder


1 Answers

The "main content" feature in the Tika GUI is implemented using the BoilerpipeContentHandler class that relies on the boilerpipe library for the heavy lifting.

like image 139
Jukka Zitting Avatar answered Mar 08 '23 11:03

Jukka Zitting