Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to encode properly this URL

Tags:

java

jsoup

I am trying to get this URL using JSoup

http://betatruebaonline.com/img/parte/330/CIGUEÑAL.JPG

Even using encoding, I got an exception. I don´t understand why the encoding is wrong. It returns

http://betatruebaonline.com/img/parte/330/CIGUEN%C3%91AL.JPG

instead the correct

http://betatruebaonline.com/img/parte/330/CIGUEN%CC%83AL.JPG

How I can fix this ? Thanks.

private static void GetUrl()
{
    try
    {
        String url = "http://betatruebaonline.com/img/parte/330/";
        String encoded = URLEncoder.encode("CIGUEÑAL.JPG","UTF-8");
        Response img = Jsoup
                            .connect(url + encoded)
                            .ignoreContentType(true)
                            .execute();

        System.out.println(url);
        System.out.println("PASSED");
    }
    catch(Exception e)
    {
        System.out.println("Error getting url");
        System.out.println(e.getMessage());
    }
}
like image 346
ppk Avatar asked Apr 11 '18 07:04

ppk


3 Answers

The encoding is not wrong, the problem here is composite unicode & precomposed unicode of character "Ñ" can be displayed in 2 ways, they look the same but really different

precomposed unicode: Ñ           -> %C3%91
composite unicode: N and ~       -> N%CC%83

I emphasize that BOTH ARE CORRECT, it depends on which type of unicode you want:

String normalize = Normalizer.normalize("Ñ", Normalizer.Form.NFD);
System.out.println(URLEncoder.encode("Ñ", "UTF-8")); //%C3%91
System.out.println(URLEncoder.encode(normalize, "UTF-8")); //N%CC%83
like image 171
yelliver Avatar answered Oct 23 '22 11:10

yelliver


What happens here?

As stated by @yelliver the webserver seems to use NFD encoded unicode in it's path names. So the solution is to use the same encoding as well.

Is the webserver doing correct?

1. For those who are curious (like me), this article on Multilingual Web Addresses brings some light into the subject. In the section on IRI pathes (the part that is actually handled by the webserver), it states:

Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based punycode), multi-script path names identify resources located on many kinds of platforms, whose file systems do and will continue to use many different encodings. This makes the path much more difficult to handle than the domain name.

2. More on the subject on how to encode pathes can be found at Section 5.3.2.2. at the IETF Proposed Standard on Internationalized Resource Identifiers (IRIs) rfc3987. It says:

Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC. Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana.

3. Unicode Consortium states:

NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

Conclusion

The webserver mentioned in the question does not conform with the recommendations of the IRI standard or the unicode consortium and uses NFD encoding instead of NFC or NFKC. One way to correctly encode an URL-String is as follows

URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());

Then convert that Uri to ASCII string:

String correctEncodedURL=uri.toASCIIString(); 

The toASCIIString() calls encode() which uses NFC encoded unicode. IDN.toASCII() converts the host name to Punycode.

like image 21
jschnasse Avatar answered Oct 23 '22 09:10

jschnasse


Actually you have to convert the URL to the decomposed form before URL encoding.

Here is a solution which works using Guava and java.text.Normalizer:

import com.google.common.escape.Escaper;
import com.google.common.net.UrlEscapers;
import org.jsoup.Connection;
import org.jsoup.Jsoup;

import java.text.Normalizer;

public class JsoupImageDownload {

    public static void main(String[] args) {

        String url = "http://betatruebaonline.com/img/parte/330/CIGUEÑAL.JPG";
        String encodedurl = null;
        try {
            encodedurl = Normalizer.normalize(url, Normalizer.Form.NFD);
            Escaper escaper = UrlEscapers.urlFragmentEscaper();
            encodedurl = escaper.escape(encodedurl);
            Connection.Response img = Jsoup
                    .connect(encodedurl)
                    .ignoreContentType(true)
                    .execute();

            System.out.println(url);
            System.out.println("PASSED");
        } catch (Exception e) {
            System.out.println("Error getting url: " + encodedurl);
            System.out.println(e.getMessage());
        }
    }
}

These are the Maven dependencies:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>24.1-jre</version>
</dependency>
like image 1
gil.fernandes Avatar answered Oct 23 '22 10:10

gil.fernandes