I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript>
tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none">
is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?
I don't have an immediate solution, but the request seems reasonable so please open an issue on our JIRA for the team to discuss: https://issues.apache.org/jira/projects/TIKA/summary
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With