Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I detect unicode characters in a Java string?

Tags:

Suppose I have a string that contains Ü. How would I find all those unicode characters? Should I test for their code? How would I do that?

For example, given the string "AÜXÜ", I'd like to transform it to "AYXY". I'd like to do the same for other unicode characters, and I would hate to have to store them in a translation map of some sort.

like image 886
Geo Avatar asked Nov 04 '09 12:11

Geo


People also ask

How do I check if a string contains Unicode characters?

To check if a given String contains only unicode letters, digits or space, we use the isLetterOrDigit() and charAt() methods with decision making statements. The isLetterOrDigit(char ch) method determines whether the specific character (Unicode ch) is either a letter or a digit.

How do I find Unicode characters?

To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.

How do you check if a character appears in a string Java?

You can use string. indexOf('a') . If the char a is present in string : it returns the the index of the first occurrence of the character in the character sequence represented by this object, or -1 if the character does not occur.

How do I find a non ASCII character in a string?

To check if a string has any non-ASCII characters in it with JavaScript, we can check with a regex. to use the /^[\u0000-\u007f]*$/ regex to check if any characters in str and `str2 have only ASCII characters. ASCII characters have codes ranging from u+0000 to u+007f.


2 Answers

You could loop through your string and for every character call

If (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {  // replace with Y } 
like image 99
jitter Avatar answered Sep 24 '22 12:09

jitter


The definition of "unicode characters" is vague, but will be taken to mean UTF-8 characters not covered by the standard ISO 8859 charset. If this is true in your case, then loop through all characters in the String and test its codepoint to determine whether it is within the given character set.

Alternatively, use a Map<Character, Character> and characters in the map that contain match the keys. For example:

Map<Character, Character> charReplacementMap = new HashMap<Character, Character>() {{     put('Ü', 'Y');     // Put more here. }};  String originalString = "AÜAÜ"; StringBuilder builder = new StringBuilder();  for (char currentChar : originalString.toCharArray()) {     Character replacementChar = charReplacementMap.get(currentChar);     builder.append(replacementChar != null ? replacementChar : currentChar); }  String newString = builder.toString(); 

Or, do you mean "all characters with diacritics"? If so, then use java.text.Normalizer to remove diacritical marks:

/**  * Remove any diacritical marks (accents like ç, ñ, é, etc) from  * the given string (so that it returns plain c, n, e, etc).  * @param string The string to remove diacritical marks from.  * @return The string with removed diacritical marks, if any.  */ public static String removeDiacriticalMarks(String string) {     return Normalizer.normalize(string, Form.NFD)         .replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); } 

One pitfall, Ü would become U, not Y. Not sure if that's what you're after. If you want to replace by pronounced character, you'll really need to create a mapping. Sure, it's a tedious work, but it's done in less time than you needed to follow this topic.

like image 26
BalusC Avatar answered Sep 24 '22 12:09

BalusC