Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match only unicode letters

i have the following regex that allows only alphabets :

     /[a-zA-Z]+/

     a = "abcDF"
     if (a.match(/[a-zA-Z]+/) == a){
        //Match
     }else{
        //No Match
     } 

How can I do this using p{L} (universal - any language like german, english etc.. )

What I tried :

  a.match(/[p{l}]+/)
  a.match(/[\p{l}]+/)
  a.match(/p{l}/)
  a.match(/\p{l}/)

but all returned null for the letter a = "aB"

like image 454
user1767962 Avatar asked Nov 03 '12 14:11

user1767962


People also ask

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What does \b mean in regex?

The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length. There are three different positions that qualify as word boundaries: Before the first character in the string, if the first character is a word character.

How do you match a character except one?

To match any character except a list of excluded characters, put the excluded charaters between [^ and ] . The caret ^ must immediately follow the [ or else it stands for just itself. The character '. ' (period) is a metacharacter (it sometimes has a special meaning).

What is the regex for Unicode paragraph seperator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.


3 Answers

Starting with ECMAScript 2018, JavaScript finally supports Unicode property escapes natively.

For older versions, you either need to define all the relevant Unicode ranges yourself. Or you can use Steven Levithan's XRegExp package with Unicode add-ons and utilize its Unicode property shortcuts:

var regex = new XRegExp("^\\p{L}*$")
var a = "abcäöüéèê"
if (regex.test(a)) {
    // Match
} else {
    // No Match
}
like image 115
Tim Pietzcker Avatar answered Oct 21 '22 08:10

Tim Pietzcker


If you are willing to use Babel to build your javascript then there's a babel-plugin I have released which will transform regular expressions like /^\p{L}+$/ or /\p{^White_Space}/ into a regular expression that browsers will understand.

This is the project page: https://github.com/danielberndt/babel-plugin-utf-8-regex

like image 36
Daniel Avatar answered Oct 21 '22 08:10

Daniel


You may use \p{L} with the modern ECMAScript 2018+ compliant JavaScript environments, but you need to remember that the Unicode property classes are only supported when you pass u modifier/flag:

a.match(/\p{L}+/gu)
a.match(/\p{Alphabetic}+/gu)

will match all occurrences of 1 or more Unicode letters in the a string.

NOTE that \p{Alphabetic} (\p{Alpha}) includes all letters matched by \p{L}, plus letter numbers matched by \p{Nl} (e.g. – a character for the roman number 12), plus some other symbols matched with \p{Other_Alphabetic} (\p{OAlpha}).

There are some things to bear in mind though when using u modifier with a regex:

  • You can use Unicode code point escape sequences such as \u{1F42A} for specifying characters via code points. Normal Unicode escapes such as \u03B1 only have a range of four hexadecimal digits (which equals the basic multilingual plane) (source)
  • "Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters" (source)
  • Escaping requirements to patterns compiled with u flag are more strict: you can't escape any special characters, you can only escape those that can actually behave as special characters. See HTML input pattern not working.
like image 36
Wiktor Stribiżew Avatar answered Oct 21 '22 09:10

Wiktor Stribiżew