Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can regular expressions work with different languages?

English, of course, is a no-brainer for regex because that's what it was originally developed in/for:

Can regular expressions understand this character set?

French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?

Les expressions régulières peuvent comprendre ce jeu de caractères?

Japanese doesn't contain what I know as regex word characters to match against.

正規表現は、この文字を理解でき、設定?

like image 782
John K Avatar asked Mar 03 '10 13:03

John K


4 Answers

"[\p{L}]" This regular expression contains all characters that are letters, from all languages, upper and lower case. so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.

  • the brackets [] mean that this expression is a set.
  • If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
  • it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
  • If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
like image 176
div-ane Avatar answered Nov 19 '22 06:11

div-ane


Short answer: yes.

More specifically it depends on your regex engine supporting unicode matches (as described here).

Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).

like image 43
Lars Tackmann Avatar answered Nov 19 '22 04:11

Lars Tackmann


As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]

Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.

like image 1
Marcos Placona Avatar answered Nov 19 '22 06:11

Marcos Placona


Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.

If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.

like image 1
Williham Totland Avatar answered Nov 19 '22 06:11

Williham Totland