Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to determine if a String contains invalid encoded characters

Usage scenario

We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.

Note: We use restlet, not tomcat

Original Problem

Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.

Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.

I.e. for a query part looking like

    ...v=abcädef 

if "ISO-8859-1" is selected, the sent query part looks like

...v=abc%E4def 

but if "UTF-8" is selected, the sent query part looks like

...v=abc%C3%A4def 

Desired Solution

As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status

Current Solution In Detail

Check for each character ( == string.substring(i,i+1) )

  1. if character.getBytes()[0] equals 63 for '?'
  2. if Character.getType(character.charAt(0)) returns OTHER_SYMBOL

Code

protected List< String > getNonUnicodeCharacters( String s ) {   final List< String > result = new ArrayList< String >();   for ( int i = 0 , n = s.length() ; i < n ; i++ ) {     final String character = s.substring( i , i + 1 );     final boolean isOtherSymbol =        ( int ) Character.OTHER_SYMBOL        == Character.getType( character.charAt( 0 ) );     final boolean isNonUnicode = isOtherSymbol        && character.getBytes()[ 0 ] == ( byte ) 63;     if ( isNonUnicode )       result.add( character );   }   return result; } 

Question

Will this catch all invalid (non utf encoded) characters? Does any of you have a better (easier) solution?

Note: I checked URLDecoder with the following code

final String[] test = new String[]{   "v=abc%E4def",   "v=abc%C3%A4def" }; for ( int i = 0 , n = test.length ; i < n ; i++ ) {     System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );     System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") ); } 

This prints:

v=abc?def v=abcädef v=abcädef v=abcädef 

and it does not throw an IllegalArgumentException sigh

like image 938
Daniel Hiller Avatar asked May 20 '09 10:05

Daniel Hiller


People also ask

How do I check if a string is encoded?

So you can test if the string contains a colon, if not, urldecode it, and if that string contains a colon, the original string was url encoded, if not, check if the strings are different and if so, urldecode again and if not, it is not a valid URI.

How do I check if a string contains special characters?

To check if a string contains special characters, call the test() method on a regular expression that matches any special character. The test method will return true if the string contains at least 1 special character and false otherwise.

Is a valid UTF-8 character?

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. All other characters use two to four bytes.

How do you check if a string contains a character or not in Java?

contains() method searches the sequence of characters in the given string. It returns true if sequence of char values are found in this string otherwise returns false.


2 Answers

I asked the same question,

Handling Character Encoding in URI on Tomcat

I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,

  1. Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
  2. If you have to manually URL decode, use Latin1 as charset also.
  3. Use the fixEncoding() function to fix up encodings.

For example, to get a parameter from query string,

  String name = fixEncoding(request.getParameter("name")); 

You can do this always. String with correct encoding is not changed.

The code is attached. Good luck!

 public static String fixEncoding(String latin1) {   try {    byte[] bytes = latin1.getBytes("ISO-8859-1");    if (!validUTF8(bytes))     return latin1;       return new String(bytes, "UTF-8");     } catch (UnsupportedEncodingException e) {    // Impossible, throw unchecked    throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());   }   }   public static boolean validUTF8(byte[] input) {   int i = 0;   // Check for BOM   if (input.length >= 3 && (input[0] & 0xFF) == 0xEF     && (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {    i = 3;   }    int end;   for (int j = input.length; i < j; ++i) {    int octet = input[i];    if ((octet & 0x80) == 0) {     continue; // ASCII    }     // Check for UTF-8 leading byte    if ((octet & 0xE0) == 0xC0) {     end = i + 1;    } else if ((octet & 0xF0) == 0xE0) {     end = i + 2;    } else if ((octet & 0xF8) == 0xF0) {     end = i + 3;    } else {     // Java only supports BMP so 3 is max     return false;    }     while (i < end) {     i++;     octet = input[i];     if ((octet & 0xC0) != 0x80) {      // Not a valid trailing byte      return false;     }    }   }   return true;  } 

EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?

Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.

The solution I posted here is not perfect but it's the best one we found so far.

like image 119
ZZ Coder Avatar answered Oct 08 '22 09:10

ZZ Coder


You can use a CharsetDecoder configured to throw an exception if invalid chars are found:

 CharsetDecoder UTF8Decoder =       Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT); 

See CodingErrorAction.REPORT

like image 23
ante Avatar answered Oct 08 '22 09:10

ante