Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does \w match only English words in javascript regex?

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).

So what can I use instead of \w to match all letters in all languages?

like image 533
Doron Yaacoby Avatar asked Dec 29 '08 14:12

Doron Yaacoby


People also ask

What does \W in regex include?

\w (word character) matches any single letter, number or underscore (same as [a-zA-Z0-9_] ). The uppercase counterpart \W (non-word-character) matches any single character that doesn't match by \w (same as [^a-zA-Z0-9_] ). In regex, the uppercase metacharacter is always the inverse of the lowercase counterpart.

Does regex work for other languages?

Regex support is part of the standard library of many programming languages, including Java and Python, and is built into the syntax of others, including Perl and ECMAScript.

Does regex work with Unicode?

This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.

What is regex global match?

Definition and Usage. The "g" modifier specifies a global match. A global match finds all matches (compared to only the first).


3 Answers

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

like image 69
David Koelle Avatar answered Sep 20 '22 19:09

David Koelle


The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.

JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:

[\u0590-\u05FF]

This simply matches any code point in the Hebrew block.

You can match any ASCII word character or any Hebrew character with:

[\w\u0590-\u05FF]
like image 39
Jan Goyvaerts Avatar answered Sep 19 '22 19:09

Jan Goyvaerts


I think you are looking for this regex:

^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$
like image 40
lani Avatar answered Sep 19 '22 19:09

lani