Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jsoup escaping ampersand in link href

Tags:

jsoup

JSoup is escaping the ampersand in the query portion of a URL in a link href. Given the sample below

    String l_input = "<html><body>before <a href=\"http://a.b.com/ct.html\">link text</a> after</body></html>";
    org.jsoup.nodes.Document l_doc = org.jsoup.Jsoup.parse(l_input);
    org.jsoup.select.Elements l_html_links = l_doc.getElementsByTag("a");
    for (org.jsoup.nodes.Element l : l_html_links) {
      l.attr("href", "http://a.b.com/ct.html?a=111&b=222");
    }
    String l_output = l_doc.outerHtml();

The output is

    <html>
    <head></head>
    <body>
    before 
    <a href="http://a.b.com/ct.html?a=111&amp;b=222">link text</a> after
    </body>
    </html>

The single & is being escaped to &amp; . Shouldn't it stay as & ?

like image 280
Mitch Avatar asked Aug 15 '13 18:08

Mitch


2 Answers

It seems you can't do it. I went through the source and found the place where the escape happens.

It is defined in the Attribute.java

/**
 Get the HTML representation of this attribute; e.g. {@code href="index.html"}.
 @return HTML
 */
public String html() {
    return key + "=\"" + Entities.escape(value, (new Document("")).outputSettings()) + "\"";
}

There you see it is using the Entities.java jsoup takes the default outputSettings of new document(""); That's way you can't override this settings.

Maybe you should post a feature request for that.

Btw: The default Escape mode is set to base.

The Documet.java creates a default OutputSettings Objects, and there it is defined. See:

/**
 * A HTML Document.
 *
 * @author Jonathan Hedley, [email protected] 
 */
public class Document extends Element {
    private OutputSettings outputSettings = new OutputSettings();
    // ...
}


/**
 * A Document's output settings control the form of the text() and html() methods.
 */
public static class OutputSettings implements Cloneable {
    private Entities.EscapeMode escapeMode = Entities.EscapeMode.base;
    // ...
}

Workaround (unescape as XML):

With the StringEscapeUtils from the apache commons lang project you can escape those thinks easly. See:

    String unescapedXml = StringEscapeUtils.unescapeXml(l_output);
    System.out.println(unescapedXml);

This will print:

<html>
 <head></head>
 <body>
  before 
  <a href="http://a.b.com/ct.html?a=111&b=222">link text</a> after
 </body>
</html>

But of course, it will replace all &amp;...

like image 172
d0x Avatar answered Oct 03 '22 00:10

d0x


What Jsoup does it's actually the right way to write urls. E.g. if you write "id=1&copy=true" browser may interpret it as "id=1©=true". So you have to scape it.

I got this from https://groups.google.com/forum/#!topic/jsoup/eK4XxHc4Tro

like image 30
gguardin Avatar answered Oct 03 '22 01:10

gguardin