Let's say I have the following code:
String description = "★★★★★ ♫ ♬ This description ✔✔ ▬ █ ✖ is a mess. ♫ ♬ ★★★★★";
I'd like to remove the non-Latin characters: ✔
, ▬
, █
, ✖
, ♫
, ♬
and ★
.
And have it become this: This description is a mess.
I know there's probably tons of these wingdings-like characters, so instead of specifying what I'd like to remove, I think it's better to list what I want to keep: Basic Latin and Latin-1 supplements characters.
I found that I can use the following code to remove everything but the basic latin characters
String clean_description = description.replaceAll("[^\\x00-\\x7F]", "").trim();
But is there a way to also preserve the Latin-1 supplement characters?
To remove the non-ASCII characters from a string:Use the str. encode() method to encode the string using the ASCII encoding.
Removing all non-digits characters For the given String the replaceAll command using the \D pattern will remove all non digit characters, as in the below example. No character will be used to replace the removed characters.
replaceAll("\\p{C}", "?"); This will replace all non-printable characters. Where p{C} selects the invisible control characters and unused code points.
From looking at the character ranges you provided, it appears that "Basic Latin" and "Latin-1 Supplements" are adjacent (0x00
-0x7F
and 0x80
-0xFF
).
So you can use the same regex you provided, just extended out to include the "Latin-1 Supplement" characters. That would look like this:
String clean_description = description.replaceAll("[^\\x00-\\xFF]", "").trim();
As pointed out in the comments by Quinn, this does not get rid of the spaces between the removed sections, so the result has excess spaces (which may or may not be what you want). If you want those spaces removed, Quinn's regex ([^(\\x00-\\xFF)]+(?:$|\\s*)
, in case the comment is deleted) may work for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With