Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Libs for HTML sanitizing

Tags:

java

html

parsing

I'm looking for a html sanitizer which I can call per API to sanitise strings which I get from my webapp. Are there some useful easy to use libs available? Does anyone knows maybe one or two?

I don't need something big it just must be able to find unclosed tags and close them.

like image 977
onigunn Avatar asked Dec 22 '09 15:12

onigunn


People also ask

How do I disinfect HTML content?

Sanitize a string immediately setHTML() is used to sanitize a string of HTML and insert it into the Element with an id of target . The script element is disallowed by the default sanitizer so the alert is removed.

When should you sanitize HTML?

HTML sanitization is an OWASP-recommended strategy to prevent XSS vulnerabilities in web applications. HTML sanitization offers a security mechanism to remove unsafe (and potentially malicious) content from untrusted raw HTML strings before presenting them to the user.

What means sanitize HTML?

HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only whatever tags are designated “safe” and desired. HTML sanitization can be used to protect against cross-site scripting (XSS) attacks by sanitizing any HTML code submitted by a user.


5 Answers

https://github.com/OWASP/java-html-sanitizer is now marked ready for production use.

A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS.

You can use prepackaged policies

Sanitizers.FORMATTING.and(Sanitizers.LINKS)

or the tests show how you can configure your own easily:

new HtmlPolicyBuilder()
    .allowElements("a")
    .allowUrlProtocols("https")
    .allowAttributes("href").onElements("a")
    .requireRelNofollowOnLinks()

or write custom policies to do things like changing h1s to divs with a certain class:

new HtmlPolicyBuilder()
    .allowElements("h1", "p")
    .allowElements(
        new ElementPolicy() {
          public String apply(String elementName, List<String> attrs) {
            attrs.add("class");
            attrs.add("header-" + elementName);
            return "div";
          }
        }, "h1"))
like image 152
Mike Samuel Avatar answered Oct 14 '22 08:10

Mike Samuel


JTidy may help you.

like image 34
Jerome Avatar answered Oct 14 '22 10:10

Jerome


The HTML Parser JSoup also supports sanitisation by policy: http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

like image 42
eckes Avatar answered Oct 14 '22 10:10

eckes


Apart from JTidy you can also take a look at:
Nekohtml
TagSoup
Getting text in HTmL document

like image 25
Samuh Avatar answered Oct 14 '22 08:10

Samuh


http://roberto.open-lab.com/2009/11/05/a-java-html-sanitizer-also-against-xss/

like image 24
Stewart Avatar answered Oct 14 '22 09:10

Stewart