How can I match characters (with the intention of removing them) from outside the unicode Basic Multilingual Plane in java?
To remove all non-BMP characters, the following should work:
String sanitizedString = inputString.replaceAll("[^\u0000-\uFFFF]", "");
Are you looking for specific characters or all characters outside the BMP?
If the former, you can use a StringBuilder
to construct a string containing code points from the higher planes, and regex will work as expected:
String test = new StringBuilder().append("test").appendCodePoint(0x10300).append("test").toString();
Pattern regex = Pattern.compile(new StringBuilder().appendCodePoint(0x10300).toString());
Matcher matcher = regex.matcher(test);
matcher.find();
System.out.println(matcher.start());
If you're looking to remove all non-BMP characters from a string, then I'd use StringBuilder
directly rather than regex:
StringBuilder sb = new StringBuilder(test.length());
for (int ii = 0 ; ii < test.length() ; )
{
int codePoint = test.codePointAt(ii);
if (codePoint > 0xFFFF)
{
ii += Character.charCount(codePoint);
}
else
{
sb.appendCodePoint(codePoint);
ii++;
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With