Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup converting & to & when I require that info as it is

Tags:

java

jsoup

In few cases I pass JSON having the page url on which user performed some action. That page url will have those query string part which I need that for user to redirect to same page when required from my application. my JSON will be like

{
"userId":"123456789",
"pageUrl":"http://exampl.com/designs.jsp?templateId=f348aaf2-45e4-4836-9be4-9a7e63105932&kind=123",
"action":"favourite"
}

But when I run this json through Jsoup.clean(json, Whitelist.basic()) I see that & been replaced with &. Can I configure Jsoup to not to escape this character alone?

like image 602
Pokuri Avatar asked Jul 13 '15 08:07

Pokuri


People also ask

Is jsoup deprecated?

Deprecated. As of release v1. 14.1 , this class is deprecated in favour of Safelist .

What is a jsoup document?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.


2 Answers

The escaping happens in org.jsoup.nodes.Entities. This is the code in question

static void escape(StringBuilder accum, String string,
        Document.OutputSettings out, boolean inAttribute,
        boolean normaliseWhite, boolean stripLeadingWhite) {
    boolean lastWasWhite = false;
    boolean reachedNonWhite = false;
    EscapeMode escapeMode = out.escapeMode();
    CharsetEncoder encoder = out.encoder();
    CoreCharset coreCharset = CoreCharset.access$300(encoder.charset().name());
    Map map = escapeMode.getMap();
    int length = string.length();
    int codePoint;
    for (int offset = 0; offset < length; offset += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(offset);

        if (normaliseWhite) {
            if (StringUtil.isWhitespace(codePoint)) {
                if ((stripLeadingWhite) && (!(reachedNonWhite)))
                    continue;
                if (lastWasWhite)
                    continue;
                accum.append(' ');
                lastWasWhite = true;
                continue;
            }
            lastWasWhite = false;
            reachedNonWhite = true;
        }

        if (codePoint < 65536) {
            char c = (char) codePoint;

            switch (c) {
            case '&':
                accum.append("&amp;");
                break;
            case ' ':
                if (escapeMode != EscapeMode.xhtml)
                    accum.append("&nbsp;");
                else
                    accum.append(c);
                break;
            case '<':
                if (!(inAttribute))
                    accum.append("&lt;");
                else
                    accum.append(c);
                break;
            case '>':
                if (!(inAttribute))
                    accum.append("&gt;");
                else
                    accum.append(c);
                break;
            case '"':
                if (inAttribute)
                    accum.append("&quot;");
                else
                    accum.append(c);
                break;
            default:
                if (canEncode(coreCharset, c, encoder))
                    accum.append(c);
                else if (map.containsKey(Character.valueOf(c)))
                    accum.append('&')
                            .append((String) map.get(Character.valueOf(c)))
                            .append(';');
                else
                    accum.append("&#x")
                            .append(Integer.toHexString(codePoint))
                            .append(';');
            }
        } else {
            String c = new String(Character.toChars(codePoint));
            if (encoder.canEncode(c))
                accum.append(c);
            else
                accum.append("&#x").append(Integer.toHexString(codePoint))
                        .append(';');
        }
    }
}

A quick way to do what you need would be to use something like this

String str = "http://exampl.com/designs.jsp?templateId=f348aaf2-45e4-4836-9be4-9a7e63105932&kind=123";
str = Jsoup.clean(str, Whitelist.basic());
System.out.println(str);
str = Parser.unescapeEntities(str, true);
System.out.println(str);

Another way would be to extend the above class and override the method that is causing the problem, but since it's visible only to the package (default visibility) this would mean that you have to download the source, change the visibility of the above method, and the override the class (so the method would be visible).

like image 50
Alkis Kalogeris Avatar answered Oct 12 '22 10:10

Alkis Kalogeris


As a workround after applying Jsoup.clean() I am replacing &amp; with & using regex.

String url = Jsoup.clean(url, Whitelist.basic()).replaceAll("&amp;", "&");
like image 33
Pokuri Avatar answered Oct 12 '22 11:10

Pokuri