Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String split, words including accented characters

Tags:

java

regex

I'm using this regex:

x.split("[^a-zA-Z0-9']+");

This returns an array of strings with letters and/or numbers.

If I use this:

String name = "CEN01_Automated_TestCase.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got:

CEN01
Automated
TestCase
Java

But if I use this:

String name = "CEN01_Automação_Caso_Teste.java";
String[] names = name.Split.split("[^a-zA-Z0-9']+");

I got:

CEN01
Automa
o
Caso
Teste
Java

How can I modify this regex to include accented characters? (á,ã,õ, etc...)

like image 397
Jvam Avatar asked Mar 06 '13 19:03

Jvam


People also ask

How do you split a string by special characters?

To split a string by special characters, call the split() method on the string, passing it a regular expression that matches any of the special characters as a parameter. The method will split the string on each occurrence of a special character and return an array containing the results.

How do I split a word in a string?

The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.

How can you split a character having the combination of string special characters and numbers in Java?

If we want to split a String at one of these characters, special care has to be taken to escape these characters in the method parameters. One way we can use this is to use a backslash \ . For example: string.


2 Answers

From http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Categories that behave like the java.lang.Character boolean ismethodname methods (except for the deprecated ones) are available through the same \p{prop} syntax where the specified property has the name javamethodname.

Since Character class contains isAlphabetic method you can use

name.split("[^\\p{IsAlphabetic}0-9']+");

You can also use

name.split("(?U)[^\\p{Alpha}0-9']+");

but you will need to use UNICODE_CHARACTER_CLASS flag which can be used by adding (?U) in regex.

like image 144
Pshemo Avatar answered Oct 18 '22 17:10

Pshemo


I would check out the Java Documentation on Regular Expressions. There is a unicode section which I believe is what you may be looking for.

EDIT: Example

Another way would be to match on the character code you are looking for. For example

\uFFFF where FFFF is the hexadecimal number of the character you are trying to match.

Example: \u00E0 matches à

Realize that the backslash will need to be escaped in Java if you are using it as a string literal.

Read more about it here.

like image 38
Andrew Backes Avatar answered Oct 18 '22 16:10

Andrew Backes