Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly write regex for unicode first name in Java?

I need to write a regular expression so I could replace the invalid characters in user's input before sending it further. I think i need to use string.replaceAll("regex", "replacement") to do that. The particular line of code should replace all characters which are not unicode letters. So it's a white list of unicode characters. Basically it's validating and replacing the invalid characters of user's first name.

What I've found so far is this: \p{L}\p{M}, but I'm not sure how to fire it up in regexp so it would work as I explained above. Would this be a regex negation case?

like image 853
Rihards Avatar asked Jun 27 '11 13:06

Rihards


2 Answers

I don't believe that Java’s default regex library (read: outside of linking to ICU’s, which I would suggest doing even though it requires JNI) supports the Unicode properties you need for this.

If it did, you would include \p{Diacritic} in your pattern. But you need full property support for that.

I suppose that you could shoot for (\pL\pM*)+ but that fails for various diacritics: What if someone’s first name is not just Étoile but L’étoile?

Also, I thought that the problem of validating people’s names was considered virtually unsolvable, and so you should just let people use whatever they like, possibly cleaned up per RFC 3454’s “stringprep” algorithm.

like image 42
tchrist Avatar answered Nov 15 '22 04:11

tchrist


Yes, you need negation. The regular expression would be [^\p{L}] for anything except letters. Another way to write this would be \P{L}.

\p{M} means "all marks", thus [^\p{L}\p{M}] means **anything which is neither letter nor mark. This also could be written as [\P{L}&&[\P{M}]], but this is not really better.

In a Java-String all \ have to be doubled, so you would write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") there.


From a comment:

By the way, regarding to your answer, what fall in the marks category? Do I even need that? Wouldn't just letters be fine for firstname?

This category consists of the subcategories

  • Mn: Mark, Non-Spacing

    An example for this is ̀, U+0300. This is the COMBINING GRAVE ACCENT, and can be used together with a letter (the letter before) to create accented characters. For the commonly used accented characters there is already a precomposed form (e.g. é), but for other ones there is not.

  • Mc: Mark, Spacing Combining.

    These are quite seldom ... I found them mainly in south-asian scripts, and for musical notes. For example, we have U+1D165, MUSICAL SYMBOL COMBINING STEM. 텦, which could be combined with U+1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, to something like 텝텦. (Hmm, the images do not look right here. I suppose my browser does not support these characters. Have a look at the code charts, if they are wrong here.)

  • Me: Mark, Enclosing

    These are marks which somehow enclose the base letter (the previous one, if I understand right). One example would be U+20DD, ⃝, which allows creating things like A⃝. (This should be rendered as an A enclosed by a circle, if I understand right. It does not, in my browser.) Another one would be U+20E3, ⃣, COMBINING ENCLOSING KEYCAP, which should give the look of a key cap with the letter on it (A⃣). (They do not show in my browser. Have a look at the code chart, if you can't see them.)

You can find them all by searching in Unicode-Data.txt for ;Mn;, ;Mc; or ;Me;, respectively. Some more information is in the FAQ: Characters and Combining Marks.

Do you need them? I'm not sure here. Most common names (at least in latin alphabets) would use precomposed letters, I think. But the user might input them in decomposed form - I think on Mac OS X this is actually the default. You would have to run the normalization algorithm before filtering away unknown characters. (Running the normalization seems a good idea anyway if you want to compare the names and not only show them on screen.)


Edit: not directly relating to the question, but relating to the discussion in the comments:

I wrote a quick test program to show that [^\pL\pM] is not equivalent to [\PL\PM]:

package de.fencing_game.paul.examples;

import java.util.regex.*;

public class RegexSample {

    static String[] regexps = {
        "[^\\pL\\pM]", "[\\PL\\PM]",
        ".", "\\pL", "\\pM",
        "\\PL", "\\PM"
    };

    static String[] strings = {
        "x", "A", "3", "\n", ".", "\t", "\r", "\f",
        " ", "-", "!", "»", "›", "‹", "«",
        "ͳ", "Θ", "Σ", "Ϫ", "Ж", "ؤ",
        "༬", "༺", "༼", "ང", "⃓", "✄",
        "⟪", "や", "゙", 
        "+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻",
        "⑪", "⒄", "⒰", "ⓛ", "⓶",
        "\u0300" /* COMBINING GRAVE ACCENT, Mn */,
        "\u0BCD" /* TAMIL SIGN VIRAMA, Me */,
        "\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */,
        "\u2166" /* ROMAN NUMERAL SEVEN, Nl */,
    };


    public static void main(String[] params) {
        Pattern[] patterns = new Pattern[regexps.length];

        System.out.print("       ");
        for(int i = 0; i < regexps.length; i++) {
            patterns[i] = Pattern.compile(regexps[i]);
            System.out.print("| " + patterns[i] + " ");
        }
        System.out.println();
        System.out.print("-------");
        for(int i = 0; i < regexps.length; i++) {
            System.out.print("|-" +
                             "--------------".substring(0,
                                                        regexps[i].length()) +
                             "-");
        }
        System.out.println();

        for(int j = 0; j < strings.length; j++) {
            System.out.printf("U+%04x ", (int)strings[j].charAt(0));
            for(int i = 0; i < regexps.length; i++) {
                boolean match = patterns[i].matcher(strings[j]).matches();
                System.out.print("| " + (match ? "✔" : "-")  +
                                 "         ".substring(0, regexps[i].length()));
            }
            System.out.println();
        }
    }
}

Here is the output (with OpenJDK 1.6.0_20 on OpenSUSE):

       | [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM 
-------|-----------|----------|---|-----|-----|-----|-----
U+0078 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0041 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0033 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000a | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+002e | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0009 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+000d | ✔         | ✔        | - | -   | -   | ✔   | ✔   
U+000c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0020 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+002d | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0021 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00bb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2039 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+00ab | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0373 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0398 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03a3 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+03ea | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0416 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0624 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+0f2c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f3c | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0f44 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+20d3 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2704 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+27ea | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+3084 | -         | ✔        | ✔ | ✔   | -   | -   | ✔   
U+3099 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+002b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2192 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2211 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2222 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+203b | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2049 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29d3 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+29fb | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+246a | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+2484 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24b0 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24db | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+24f6 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   
U+0300 | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+0bcd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+20dd | -         | ✔        | ✔ | -   | ✔   | ✔   | -   
U+2166 | ✔         | ✔        | ✔ | -   | -   | ✔   | ✔   

We can see that:

  1. [^\pL\pM] is not equivalent to [\PL\PM]
  2. [\PL\PM] really matches everything, but
  3. still [\PL\PM] is not equal to ., since . does not match \n and \r.

The second point is caused by the fact that [\PL\PM] is the union of \PL and \PM: \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M (including L) - together they contain the whole character repertoire.

[^pL\pM], on the other hand, is the complement of the union of \pL and \pM, which is equivalent to the intersection of \PL and PM.

like image 159
Paŭlo Ebermann Avatar answered Nov 15 '22 03:11

Paŭlo Ebermann