Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika exclude some html tags

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript> tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none"> is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?

like image 842
Bociek Avatar asked Feb 22 '19 15:02

Bociek


1 Answers

I don't have an immediate solution, but the request seems reasonable so please open an issue on our JIRA for the team to discuss: https://issues.apache.org/jira/projects/TIKA/summary

like image 122
Tim Allison Avatar answered Sep 20 '22 08:09

Tim Allison