Usage scenario We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api. Note: We use restlet, not tomcat Original Problem Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8. Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts. I.e. for a query part looking like <pre class="prettyprint"><code> ...v=abcädef </code></pre> if "ISO-8859-1" is selected, the sent query part looks like <pre class="prettyprint"><code>...v=abc%E4def </code></pre> but if "UTF-8" is selected, the sent query part looks like <pre class="prettyprint"><code>...v=abc%C3%A4def </code></pre> Desired Solution As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status Current Solution In Detail Check for each character ( == string.substring(i,i+1) ) <ol> <li>if character.getBytes()[0] equals 63 for '?'</li> <li>if Character.getType(character.charAt(0)) returns OTHER_SYMBOL</li> </ol> Code <pre class="prettyprint"><code>protected List< String > getNonUnicodeCharacters( String s ) { final List< String > result = new ArrayList< String >(); for ( int i = 0 , n = s.length() ; i < n ; i++ ) { final String character = s.substring( i , i + 1 ); final boolean isOtherSymbol = ( int ) Character.OTHER_SYMBOL == Character.getType( character.charAt( 0 ) ); final boolean isNonUnicode = isOtherSymbol && character.getBytes()[ 0 ] == ( byte ) 63; if ( isNonUnicode ) result.add( character ); } return result; } </code></pre> Question Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution? Note: I checked URLDecoder with the following code <pre class="prettyprint"><code>final String[] test = new String[]{ "v=abc%E4def", "v=abc%C3%A4def" }; for ( int i = 0 , n = test.length ; i < n ; i++ ) { System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") ); System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") ); } </code></pre> This prints: <pre class="prettyprint"><code>v=abc?def v=abcädef v=abcädef v=abcÃ¤def </code></pre> and it does not throw an IllegalArgumentException sigh

I asked the same question, Handling Character Encoding in URI on Tomcat I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do, <ol> <li>Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.</li> <li>If you have to manually URL decode, use Latin1 as charset also.</li> <li>Use the fixEncoding() function to fix up encodings.</li> </ol> For example, to get a parameter from query string, <pre class="prettyprint"><code> String name = fixEncoding(request.getParameter("name")); </code></pre> You can do this always. String with correct encoding is not changed. The code is attached. Good luck! <pre class="prettyprint"><code> public static String fixEncoding(String latin1) { try { byte[] bytes = latin1.getBytes("ISO-8859-1"); if (!validUTF8(bytes)) return latin1; return new String(bytes, "UTF-8"); } catch (UnsupportedEncodingException e) { // Impossible, throw unchecked throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage()); } } public static boolean validUTF8(byte[] input) { int i = 0; // Check for BOM if (input.length >= 3 && (input[0] & 0xFF) == 0xEF && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) { i = 3; } int end; for (int j = input.length; i < j; ++i) { int octet = input[i]; if ((octet & 0x80) == 0) { continue; // ASCII } // Check for UTF-8 leading byte if ((octet & 0xE0) == 0xC0) { end = i + 1; } else if ((octet & 0xF0) == 0xE0) { end = i + 2; } else if ((octet & 0xF8) == 0xF0) { end = i + 3; } else { // Java only supports BMP so 3 is max return false; } while (i < end) { i++; octet = input[i]; if ((octet & 0xC0) != 0x80) { // Not a valid trailing byte return false; } } } return true; } </code></pre> EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ? Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away. The solution I posted here is not perfect but it's the best one we found so far.

You can use a CharsetDecoder configured to throw an exception if invalid chars are found: <pre class="prettyprint"><code> CharsetDecoder UTF8Decoder = Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT); </code></pre> See CodingErrorAction.REPORT

How to determine if a String contains invalid encoded characters

Tags:

java

string

encoding

unicode

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

Click to copy

    ...v=abcädef

if "ISO-8859-1" is selected, the sent query part looks like

Click to copy

...v=abc%E4def

but if "UTF-8" is selected, the sent query part looks like

Click to copy

...v=abc%C3%A4def

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

Click to copy

protected List< String > getNonUnicodeCharacters( String s ) {   final List< String > result = new ArrayList< String >();   for ( int i = 0 , n = s.length() ; i < n ; i++ ) {     final String character = s.substring( i , i + 1 );     final boolean isOtherSymbol =        ( int ) Character.OTHER_SYMBOL        == Character.getType( character.charAt( 0 ) );     final boolean isNonUnicode = isOtherSymbol        && character.getBytes()[ 0 ] == ( byte ) 63;     if ( isNonUnicode )       result.add( character );   }   return result; }

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

Click to copy

final String[] test = new String[]{   "v=abc%E4def",   "v=abc%C3%A4def" }; for ( int i = 0 , n = test.length ; i < n ; i++ ) {     System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );     System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") ); }

This prints:

Click to copy

v=abc?def v=abcädef v=abcädef v=abcÃ¤def

and it does not throw an IllegalArgumentException sigh

938

asked May 20 '09 10:05

Daniel Hiller

2 Answers

I asked the same question,

Handling Character Encoding in URI on Tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
If you have to manually URL decode, use Latin1 as charset also.
Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

Click to copy

  String name = fixEncoding(request.getParameter("name"));

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

Click to copy

 public static String fixEncoding(String latin1) {   try {    byte[] bytes = latin1.getBytes("ISO-8859-1");    if (!validUTF8(bytes))     return latin1;       return new String(bytes, "UTF-8");     } catch (UnsupportedEncodingException e) {    // Impossible, throw unchecked    throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());   }   }   public static boolean validUTF8(byte[] input) {   int i = 0;   // Check for BOM   if (input.length >= 3 && (input[0] & 0xFF) == 0xEF     && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {    i = 3;   }    int end;   for (int j = input.length; i < j; ++i) {    int octet = input[i];    if ((octet & 0x80) == 0) {     continue; // ASCII    }     // Check for UTF-8 leading byte    if ((octet & 0xE0) == 0xC0) {     end = i + 1;    } else if ((octet & 0xF0) == 0xE0) {     end = i + 2;    } else if ((octet & 0xF8) == 0xF0) {     end = i + 3;    } else {     // Java only supports BMP so 3 is max     return false;    }     while (i < end) {     i++;     octet = input[i];     if ((octet & 0xC0) != 0x80) {      // Not a valid trailing byte      return false;     }    }   }   return true;  }

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

119

answered Oct 08 '22 09:10

ZZ Coder

You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

Click to copy

 CharsetDecoder UTF8Decoder =       Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);

See CodingErrorAction.REPORT

answered Oct 08 '22 09:10

ante

Related questions
                            
                                Creating JSON objects directly from model classes in Java
                            
                                How to wait for a ThreadPoolExecutor to finish
                            
                                Is it bad to use polling in Java?
                            
                                How to remove control characters from java string?
                            
                                Getting environment variable value in java
                            
                                Creating nested JSON object for the following structure in Java using JSONObject? [closed]
                            
                                Setting timezone for maven unit tests on Java 8
                            
                                Why is non-breaking space not a whitespace character in Java?
                            
                                Error: Servlet Jar not Loaded... Offending class: javax/servlet/Servlet.class
                            
                                Centering Text in a JTextArea or JTextPane - Horizontal Text Alignment
                            
                                Regular Expression to match 3 or more Consecutive Sequential Characters and Consecutive Identical Characters
                            
                                terminate or break java 8 stream loop [duplicate]
                            
                                Calling virtual method in base class constructor
                            
                                What are first-class objects in Java and C#?
                            
                                How to change cursor icon in Java?
                            
                                Getting absolute path of a file loaded via classpath
                            
                                Package name is different than the folder structure but still Java code compiles
                            
                                Java Swing adding Action Listener for EXIT_ON_CLOSE
                            
                                How to randomly receive a Material Design Color?
                            
                                Encoding conversion in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to determine if a String contains invalid encoded characters

Tags:

java

string

encoding

unicode

Daniel Hiller

People also ask

2 Answers

ZZ Coder

ante

Recent Activity

Donate For Us