JSoup is escaping the ampersand in the query portion of a URL in a link href. Given the sample below
String l_input = "<html><body>before <a href=\"http://a.b.com/ct.html\">link text</a> after</body></html>";
org.jsoup.nodes.Document l_doc = org.jsoup.Jsoup.parse(l_input);
org.jsoup.select.Elements l_html_links = l_doc.getElementsByTag("a");
for (org.jsoup.nodes.Element l : l_html_links) {
l.attr("href", "http://a.b.com/ct.html?a=111&b=222");
}
String l_output = l_doc.outerHtml();
The output is
<html>
<head></head>
<body>
before
<a href="http://a.b.com/ct.html?a=111&b=222">link text</a> after
</body>
</html>
The single & is being escaped to & . Shouldn't it stay as & ?
It seems you can't do it. I went through the source and found the place where the escape happens.
It is defined in the Attribute.java
/**
Get the HTML representation of this attribute; e.g. {@code href="index.html"}.
@return HTML
*/
public String html() {
return key + "=\"" + Entities.escape(value, (new Document("")).outputSettings()) + "\"";
}
There you see it is using the Entities.java jsoup takes the default outputSettings of new document("");
That's way you can't override this settings.
Maybe you should post a feature request for that.
Btw: The default Escape mode is set to base
.
The Documet.java creates a default OutputSettings
Objects, and there it is defined. See:
/**
* A HTML Document.
*
* @author Jonathan Hedley, [email protected]
*/
public class Document extends Element {
private OutputSettings outputSettings = new OutputSettings();
// ...
}
/**
* A Document's output settings control the form of the text() and html() methods.
*/
public static class OutputSettings implements Cloneable {
private Entities.EscapeMode escapeMode = Entities.EscapeMode.base;
// ...
}
Workaround (unescape as XML):
With the StringEscapeUtils
from the apache commons lang project you can escape those thinks easly. See:
String unescapedXml = StringEscapeUtils.unescapeXml(l_output);
System.out.println(unescapedXml);
This will print:
<html>
<head></head>
<body>
before
<a href="http://a.b.com/ct.html?a=111&b=222">link text</a> after
</body>
</html>
But of course, it will replace all &
...
What Jsoup does it's actually the right way to write urls. E.g. if you write "id=1©=true" browser may interpret it as "id=1©=true". So you have to scape it.
I got this from https://groups.google.com/forum/#!topic/jsoup/eK4XxHc4Tro
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With