Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex diacritics

Tags:

java

regex

I have the following regex:

String regExpression = "^[a-zA-Z0-9+,. '-]{1,"+maxCharacters+"}$";

which works fine for me, except, it doesn't allow any UTF-8 diacritics in it(Ă ă Â â Î î Ș ș Ț ț).

I only need my current regex to accept diacritics in it besides what it already does.

Any help is appreciated. Thanks.

like image 990
Fofole Avatar asked Apr 17 '12 09:04

Fofole


1 Answers

You need to look into the POSIX character classes to catch those. Sadly Java Regex don't support language specific POSIX classes but maybe \p{Graph} A visible character: [\p{Alnum}\p{Punct}] or \p{Print} A printable character: [\p{Graph}\x20] will fit.

Best fit as suggested by Sorin probably is \p{L} (Letter).

import java.util.regex.Pattern;

public class Regexer {

    public static void main(String[] args) {
        int maxCharacters = 100;
        String data = "Ă ă Â â Î î Ș ș Ț ț";
        String pattern = "^[\\p{L}0-9+,. '-]{1," + maxCharacters + "}$";

        Pattern p = Pattern.compile(pattern);

        if (p.matcher(data).matches()) {
            System.out.println("Hit");
        } else {
            System.out.println("No");
        }

    }
}

This works for me.

like image 97
Hauke Ingmar Schmidt Avatar answered Nov 15 '22 05:11

Hauke Ingmar Schmidt