Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering out UTF-8 punctuations and symbols from a String

Tags:

java

regex

utf-8

What is the best and most efficient way to filter out all UTF-8 punctuation characters and symbols like ✀ ✁ ✂ ✃ ✄ ✅ ✆ ✇ ✈ etc from a String. Simply filtering out all characters that are not in a-z, A-Z and 0-9 is not an option, because I want to keep letters from other languages (ą, ę, ó etc.) Thanks in advance.

like image 597
user1315305 Avatar asked Mar 23 '23 23:03

user1315305


2 Answers

You could use \p{L} to match all unicode letters. Example:

public static void main(String[] args) throws IOException {
    String[] test = {"asdEWR1", "ąęóöòæûùÜ", "sd,", "✀","✁","✂","✃","✄","✅","✆","✇","✈"};
    for (String s : test)
        System.out.println(s + " => " + s.replaceAll("[^\\p{L}^\\d]", ""));
}

outputs:

asdEWR1 => asdEWR1
ąęóöòæûùÜ => ąęóöòæûùÜ
sd, => sd
✀ => 
✁ => 
✂ => 
✃ => 
✄ => 
✅ => 
✆ => 
✇ => 
✈ => 
like image 180
assylias Avatar answered Apr 13 '23 20:04

assylias


Try the combinations of unicode binary classifications:

String fixed = value.replaceAll("[^\\p{IsAlphabetic}\\p{IsDigit}]", "");
like image 24
rolfl Avatar answered Apr 13 '23 20:04

rolfl