Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java remove non Latin-basic characters from string

Let's say I have the following code:

String description = "★★★★★  ♫ ♬ This description ✔✔  ▬ █ ✖  is a mess. ♫ ♬ ★★★★★";

I'd like to remove the non-Latin characters: , , , , , and .

And have it become this: This description is a mess.

I know there's probably tons of these wingdings-like characters, so instead of specifying what I'd like to remove, I think it's better to list what I want to keep: Basic Latin and Latin-1 supplements characters.

I found that I can use the following code to remove everything but the basic latin characters

String clean_description = description.replaceAll("[^\\x00-\\x7F]", "").trim();

But is there a way to also preserve the Latin-1 supplement characters?

like image 830
RoboticR Avatar asked Mar 16 '16 14:03

RoboticR


People also ask

How do I remove non ascii characters from a string?

To remove the non-ASCII characters from a string:Use the str. encode() method to encode the string using the ASCII encoding.

How do I remove all non numeric characters from a string in Java?

Removing all non-digits characters For the given String the replaceAll command using the \D pattern will remove all non digit characters, as in the below example. No character will be used to replace the removed characters.

How do you remove hidden characters in Java?

replaceAll("\\p{C}", "?"); This will replace all non-printable characters. Where p{C} selects the invisible control characters and unused code points.


1 Answers

From looking at the character ranges you provided, it appears that "Basic Latin" and "Latin-1 Supplements" are adjacent (0x00-0x7F and 0x80-0xFF).

So you can use the same regex you provided, just extended out to include the "Latin-1 Supplement" characters. That would look like this:

String clean_description = description.replaceAll("[^\\x00-\\xFF]", "").trim();

As pointed out in the comments by Quinn, this does not get rid of the spaces between the removed sections, so the result has excess spaces (which may or may not be what you want). If you want those spaces removed, Quinn's regex ([^(\\x00-\\xFF)]+(?:$|\\s*), in case the comment is deleted) may work for you.

like image 66
resueman Avatar answered Oct 16 '22 15:10

resueman