Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cause of error - Jsoup.isValid

Tags:

java

jsoup

I have the following code which works but I just want to know if it is possible in Jsoup to pinpoint the exact cause of error.

The following returns true (as expected)

private void validateProtocol() {
        String html = "<p><a href='https://example.com/'>Link</a></p>";

        Whitelist whiteList = Whitelist.basic();
        whiteList.addProtocols("a","href","tel");
        whiteList.removeProtocols("a","href","ftp");
        boolean safe = Jsoup.isValid(html, whiteList);
        System.out.println(safe);
    }

When I change the above string to it returns false(as expected)

String html = "<p><a href='ftp://example.com/'>Link</a></p>";

Now when I have the following code, there are two errors one is an invalid protocol and one is the onfocus() link.

private void validateProtocol() {
            String html = "<p><a href='ftp://example.com/' onfocus='invalidLink()'>Link</a></p>";

            Whitelist whiteList = Whitelist.basic();
            whiteList.addProtocols("a","href","tel", "device");
            whiteList.removeProtocols("a","href","ftp");
            boolean safe = Jsoup.isValid(html, whiteList);
            System.out.println(safe);
        }

The result is false but is there any way to figure out which part of the URL is false? for example - wrong protocol or wrong method..?

like image 436
Abi Avatar asked Jun 25 '15 08:06

Abi


People also ask

What does Jsoup clean do?

clean. Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist. The original document is not modified. Only elements from the dirty document's body are used.

Is Jsoup slow?

Jsoup connection might become slow because of: your internet connection speed or. CPU usage (Some other program is eating up memory!) or. the respond speed of the web server you are accessing.


1 Answers

You want to create a custom whitelist with reporting feature.

MyReportEnabledWhitelist.java

public class MyReportEnabledWhitelist extends Whitelist {

    private Set<String> alreadyCheckedAttributeSignatures = new HashSet<>();

    @Override
    protected boolean isSafeTag(String tag) {
        boolean isSafe = super.isSafeTag(tag);

        if (!isSafe) {
            say("Disallowed tag: " + tag);
        }

        return isSafe;
    }

    @Override
    protected boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
        boolean isSafe = super.isSafeAttribute(tagName, el, attr);

        String signature = el.hashCode() + "-" + attr.hashCode();
        if (alreadyCheckedAttributeSignatures.contains(signature) == false) {
            alreadyCheckedAttributeSignatures.add(signature);

            if (!isSafe) {
                say("Wrong attribute: " + attr.getKey() + " (" + attr.html() + ") in " + el.outerHtml());
            }
        }

        return isSafe;
    }
}

SAMPLE CODE

String html = "<p><a href='ftp://example.com/' onfocus='invalidLink()'>Link</a></p><a href='ftp://example2.com/'>Link 2</a>";

// * Custom whitelist
Whitelist myReportEnabledWhitelist = new MyReportEnabledWhitelist()
    // ** Basic whitelist (from Jsoup)
    .addTags("a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em", "i", "li", "ol", "p", "pre", "q", "small", "span",
                "strike", "strong", "sub", "sup", "u", "ul") //

    .addAttributes("a", "href") //
    .addAttributes("blockquote", "cite") //
    .addAttributes("q", "cite") //

    .addProtocols("a", "href", "ftp", "http", "https", "mailto") //
    .addProtocols("blockquote", "cite", "http", "https") //
    .addProtocols("cite", "cite", "http", "https") //

    .addEnforcedAttribute("a", "rel", "nofollow") //

    // ** Customizations
    .addTags("body") //
    .addProtocols("a", "href", "tel", "device") //
    .removeProtocols("a", "href", "ftp");

boolean safeCustom = Jsoup.isValid(html, myReportEnabledWhitelist);
System.out.println(safeCustom);

OUTPUT

Wrong attribute: href (href="ftp://example.com/") in <a href="ftp://example.com/" onfocus="invalidLink()">Link</a>
Wrong attribute: onfocus (onfocus="invalidLink()") in <a href="ftp://example.com/" onfocus="invalidLink()">Link</a>
Wrong attribute: href (href="ftp://example2.com/") in <a href="ftp://example2.com/">Link 2</a>
false
like image 63
Stephan Avatar answered Nov 05 '22 03:11

Stephan