Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
...v=abcädef
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
Code
protected List< String > getNonUnicodeCharacters( String s ) { final List< String > result = new ArrayList< String >(); for ( int i = 0 , n = s.length() ; i < n ; i++ ) { final String character = s.substring( i , i + 1 ); final boolean isOtherSymbol = ( int ) Character.OTHER_SYMBOL == Character.getType( character.charAt( 0 ) ); final boolean isNonUnicode = isOtherSymbol && character.getBytes()[ 0 ] == ( byte ) 63; if ( isNonUnicode ) result.add( character ); } return result; }
Question
Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{ "v=abc%E4def", "v=abc%C3%A4def" }; for ( int i = 0 , n = test.length ; i < n ; i++ ) { System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") ); System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") ); }
This prints:
v=abc?def v=abcädef v=abcädef v=abcädef
and it does not throw an IllegalArgumentException sigh
So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.
To check if a string contains special characters, call the test() method on a regular expression that matches any special character. The test method will return true if the string contains at least 1 special character and false otherwise.
UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.
contains() method searches the sequence of characters in the given string. It returns true if sequence of char values are found in this string otherwise returns false.
I asked the same question,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
The code is attached. Good luck!
public static String fixEncoding(String latin1) { try { byte[] bytes = latin1.getBytes("ISO-8859-1"); if (!validUTF8(bytes)) return latin1; return new String(bytes, "UTF-8"); } catch (UnsupportedEncodingException e) { // Impossible, throw unchecked throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage()); } } public static boolean validUTF8(byte[] input) { int i = 0; // Check for BOM if (input.length >= 3 && (input[0] & 0xFF) == 0xEF && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) { i = 3; } int end; for (int j = input.length; i < j; ++i) { int octet = input[i]; if ((octet & 0x80) == 0) { continue; // ASCII } // Check for UTF-8 leading byte if ((octet & 0xE0) == 0xC0) { end = i + 1; } else if ((octet & 0xF0) == 0xE0) { end = i + 2; } else if ((octet & 0xF8) == 0xF0) { end = i + 3; } else { // Java only supports BMP so 3 is max return false; } while (i < end) { i++; octet = input[i]; if ((octet & 0xC0) != 0x80) { // Not a valid trailing byte return false; } } } return true; }
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
The solution I posted here is not perfect but it's the best one we found so far.
You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
CharsetDecoder UTF8Decoder = Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
See CodingErrorAction.REPORT
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With