What is the best and most efficient way to filter out all UTF-8 punctuation characters and symbols like ✀ ✁ ✂ ✃ ✄ ✅ ✆ ✇ ✈ etc from a String. Simply filtering out all characters that are not in a-z, A-Z and 0-9 is not an option, because I want to keep letters from other languages (ą, ę, ó etc.) Thanks in advance.
You could use \p{L}
to match all unicode letters. Example:
public static void main(String[] args) throws IOException {
String[] test = {"asdEWR1", "ąęóöòæûùÜ", "sd,", "✀","✁","✂","✃","✄","✅","✆","✇","✈"};
for (String s : test)
System.out.println(s + " => " + s.replaceAll("[^\\p{L}^\\d]", ""));
}
outputs:
asdEWR1 => asdEWR1
ąęóöòæûùÜ => ąęóöòæûùÜ
sd, => sd
✀ =>
✁ =>
✂ =>
✃ =>
✄ =>
✅ =>
✆ =>
✇ =>
✈ =>
Try the combinations of unicode binary classifications:
String fixed = value.replaceAll("[^\\p{IsAlphabetic}\\p{IsDigit}]", "");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With