Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove all non-"word characters" from a String in Java, leaving accented characters?

Tags:

java

string

regex

Apparently Java's Regex flavor counts Umlauts and other special characters as non-"word characters" when I use Regex.

        "TESTÜTEST".replaceAll( "\\W", "" ) 

returns "TESTTEST" for me. What I want is for only all truly non-"word characters" to be removed. Any way to do this without having something along the lines of

         "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]" 

only to realize I forgot ô?

like image 718
Epaga Avatar asked Oct 23 '09 08:10

Epaga


People also ask

How do I remove all non character characters from a string in Java?

A common solution to remove all non-alphanumeric characters from a String is with regular expressions. The idea is to use the regular expression [^A-Za-z0-9] to retain only alphanumeric characters in the string. You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] .

How do I remove the accented character in Java?

string = string. replaceAll("[^\\p{ASCII}]", "");

How do you remove all of a certain character from a string Java?

Using String.replace() method to remove all occurrences of each character from the string within a loop.


1 Answers

Use [^\p{L}\p{Nd}]+ - this matches all (Unicode) characters that are neither letters nor (decimal) digits.

In Java:

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", ""); 

Edit:

I changed \p{N} to \p{Nd} because the former also matches some number symbols like ¼; the latter doesn't. See it on regex101.com.

like image 97
Tim Pietzcker Avatar answered Oct 29 '22 18:10

Tim Pietzcker