Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup connect doesn't work correctly when link has Turkish letters

I'm using Jsoup to get html from web sites. I'm using

String url="http://www.example.com";
Document doc=Jsoup.connect(url).get();

this code to get html. But when I use some Turkish letters in the link like this;

String url="http://www.example.com/?q=Türkçe";
Document doc=Jsoup.connect(url).get();

Jsoup sends the request like this: "http://www.example.com/?q=Trke"

So I can't get the correct result. How can I solve this problem?

like image 650
Erdinç Özdemir Avatar asked Jan 15 '14 08:01

Erdinç Özdemir


3 Answers

Working solution, if encoding is UTF-8 then simply use

Document document = Jsoup.connect("http://www.example.com")
        .data("q", "Türkçe")
        .get();

with result

URL=http://www.example.com?q=T%C3%BCrk%C3%A7e

For custom encoding this can be used:

String encodedUrl = URLEncoder.encode("http://www.example.com/q=Türk&#231e", "ISO-8859-3");
String encodedBaseUrl = URLEncoder.encode("http://www.example.com/q=", "ISO-8859-3");
String query = encodedUrl.replace(encodedBaseUrl, "");

Document doc= Jsoup.connect("http://www.example.com")
        .data("q", query)
        .get();
like image 82
MariuszS Avatar answered Sep 28 '22 00:09

MariuszS


Unicode Characters are not allowed in URLs as per the specification. We're used to see them, because browsers display them in adress bars, but they are not sent to servers.

You have to URL encode your path before passing it to JSoup. Jsoup.connect("http://www.example.com").data("q", "Türkçe") as proposed by MariuszS does just that

like image 29
Grooveek Avatar answered Sep 27 '22 22:09

Grooveek


I found this on google: http://turkishbasics.com/resources/turkish-characters-html-codes.php Maybe u can add it like this:

 String url="http://www.example.com/?q=Türk&#231e";
 Document doc=Jsoup.connect(url).get();
like image 31
Fraggles Avatar answered Sep 28 '22 00:09

Fraggles