Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java equivalent to JavaScript's encodeURIComponent that produces identical output?

I've been experimenting with various bits of Java code trying to come up with something that will encode a string containing quotes, spaces and "exotic" Unicode characters and produce output that's identical to JavaScript's encodeURIComponent function.

My torture test string is: "A" B ± "

If I enter the following JavaScript statement in Firebug:

encodeURIComponent('"A" B ± "'); 

—Then I get:

"%22A%22%20B%20%C2%B1%20%22" 

Here's my little test Java program:

import java.io.UnsupportedEncodingException; import java.net.URLEncoder;  public class EncodingTest {   public static void main(String[] args) throws UnsupportedEncodingException   {     String s = "\"A\" B ± \"";     System.out.println("URLEncoder.encode returns "       + URLEncoder.encode(s, "UTF-8"));      System.out.println("getBytes returns "       + new String(s.getBytes("UTF-8"), "ISO-8859-1"));   } } 

—This program outputs:

URLEncoder.encode returns %22A%22+B+%C2%B1+%22 getBytes returns "A" B ± "

Close, but no cigar! What is the best way of encoding a UTF-8 string using Java so that it produces the same output as JavaScript's encodeURIComponent?

EDIT: I'm using Java 1.4 moving to Java 5 shortly.

like image 275
John Topley Avatar asked Mar 03 '09 16:03

John Topley


People also ask

Should I use encodeURI or encodeURIComponent?

encodeURIComponent should be used to encode a URI Component - a string that is supposed to be part of a URL. encodeURI should be used to encode a URI or an existing URL.

What is the difference between encodeURI and encodeURIComponent?

The difference between encodeURI and encodeURIComponent is encodeURIComponent encodes the entire string, where encodeURI ignores protocol prefix ('http://') and domain name. encodeURIComponent is designed to encode everything, where encodeURI ignores a URL's domain related roots.

What is difference between decodeURI and decodeURIComponent?

decodeURI(): It takes encodeURI(url) string as parameter and returns the decoded string. decodeURIComponent(): It takes encodeURIComponent(url) string as parameter and returns the decoded string.

What is Urlencode in Java?

Simply put, URL encoding translates special characters from the URL to a representation that adheres to the spec and can be correctly understood and interpreted.


2 Answers

This is the class I came up with in the end:

import java.io.UnsupportedEncodingException; import java.net.URLDecoder; import java.net.URLEncoder;  /**  * Utility class for JavaScript compatible UTF-8 encoding and decoding.  *   * @see http://stackoverflow.com/questions/607176/java-equivalent-to-javascripts-encodeuricomponent-that-produces-identical-output  * @author John Topley   */ public class EncodingUtil {   /**    * Decodes the passed UTF-8 String using an algorithm that's compatible with    * JavaScript's <code>decodeURIComponent</code> function. Returns    * <code>null</code> if the String is <code>null</code>.    *    * @param s The UTF-8 encoded String to be decoded    * @return the decoded String    */   public static String decodeURIComponent(String s)   {     if (s == null)     {       return null;     }      String result = null;      try     {       result = URLDecoder.decode(s, "UTF-8");     }      // This exception should never occur.     catch (UnsupportedEncodingException e)     {       result = s;       }      return result;   }    /**    * Encodes the passed String as UTF-8 using an algorithm that's compatible    * with JavaScript's <code>encodeURIComponent</code> function. Returns    * <code>null</code> if the String is <code>null</code>.    *     * @param s The String to be encoded    * @return the encoded String    */   public static String encodeURIComponent(String s)   {     String result = null;      try     {       result = URLEncoder.encode(s, "UTF-8")                          .replaceAll("\\+", "%20")                          .replaceAll("\\%21", "!")                          .replaceAll("\\%27", "'")                          .replaceAll("\\%28", "(")                          .replaceAll("\\%29", ")")                          .replaceAll("\\%7E", "~");     }      // This exception should never occur.     catch (UnsupportedEncodingException e)     {       result = s;     }      return result;   }      /**    * Private constructor to prevent this class from being instantiated.    */   private EncodingUtil()   {     super();   } } 
like image 78
John Topley Avatar answered Sep 20 '22 21:09

John Topley


Looking at the implementation differences, I see that:

MDC on encodeURIComponent():

  • literal characters (regex representation): [-a-zA-Z0-9._*~'()!]

Java 1.5.0 documentation on URLEncoder:

  • literal characters (regex representation): [-a-zA-Z0-9._*]
  • the space character " " is converted into a plus sign "+".

So basically, to get the desired result, use URLEncoder.encode(s, "UTF-8") and then do some post-processing:

  • replace all occurrences of "+" with "%20"
  • replace all occurrences of "%xx" representing any of [~'()!] back to their literal counter-parts
like image 39
Tomalak Avatar answered Sep 21 '22 21:09

Tomalak