Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split and replace unicode words in javascript with regex

Need to put list of unicode words in unicode string in {}. There is my code:

var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));

It returns:

¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?

but should be:

¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?

What I'm doing wrong?

like image 676
John Avatar asked Apr 06 '11 07:04

John


2 Answers

The Problem

What am I doing wrong?

Unfortunately, the answer is that you are doing nothing wrong. Javascript is.

The problem is that Javascript does not support Unicode regular expressions as such are spelled out in The Unicode Standard.

There is, however, a rather nice library called XRegExp which has a JavaScript plugin that helps a great deal. I recommend it, albeit with several notable caveats. You need to know what it can do, and what it cannot.


What It Does

  • Corrects various bugs in inconsistencies in Javascript implementations, including its split function.
  • Supports the BMP code points covered by the 6.1 release of the Unicode Character Database, from January 2012.
  • Correctly ignores case, space, hyphen-minuses, and underscores in Unicode property names, per The Standard — something which even Java gets wrong.
  • Supports the Unicode General Categories like \p{L} for letters and \p{Sc} for currency symbols.
  • Support the standard full property names like \p{Letter} for \p{L} and \p{Currency_Symbol} for \p{Sc}.
  • Supports the Unicode Script properties, like \p{Latin}, \p{Greek}, and \p{Common}.
  • Supports the Unicode Block properties, like \p{InBasic_Latin} and \p{InMathematical_Alphanumeric_Symbols}.
  • Supports the other 9 Unicode properties needed for level-1 compliance: \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{Any}, \p{ASCII}, and \p{Assigned}.
  • Supports named captures instead of just numbered ones, using standard notation to do so: (?<NAME>⋯) to declare a named group, \k<NAME> to backref it by name, and use ${NAME} in the replacement pattern (and in general access it using result.NAME in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing complex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables.
  • Supports /s ᴀᴋᴀ (?s) mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode.
  • Supports /x ᴀᴋᴀ (?x) mode so that whitespace and comments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns.
  • Supports embedded comments even when not in /x mode using the standard (?#⋯) notation to do so (such as seen in Perl). This lets you put comments in individual regex pieces without going all the way to /x mode, which is often important in developing more complex patterns, by allowing you to build them up piece-wise.
  • Supports extensibility, so that you can add new token types if you want, such as \a to mean the ALERT character, or the POSIXish character classes.

What It Doesn’t

You should be careful, however, for the things that it does not do:

  • Does not support full Unicode, but only code points from Plane 0. This is a forbidden restriction, as The Unicode Standard requires that there be no difference between astral and non-astral code points in a regular expression. Even Java doesn’t get this right until JDK7. (However, the v2.1.0 development version does support full Unicode.)
  • Does not support \X for grapheme clusters, or \R for linebreak sequences.
  • Does not support two-part properties, like \p{GC=Letter}, \p{Block=Phonetic_Extensions}, \p{Script=Greek}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter}, and \p{Numeric_Value=10}.
  • It does not update the character class shortcuts to operate per the requirements of UTS#18. Standard JavaScript only allows \s to match the Unicode \p{White_Space} property; it does not allow \d to match \p{Nd} (although some old browsers will do that anyway!) nor \w to match [\p{Alphabetic}\pM\p{Nd}\p{Pc}], let alone providing Unicode-aware versions of \b and \B, all of which are part of the requirements for supporting Unicode Regular Expressions.
  • It does not support some commonly used properties. In practice, the one that is missing is \p{digit}, and perhaps also the rather useful \p{Dash}, \p{Math}, \p{Diacritic}, and \p{Quotation_Mark} properties.
  • Has no support for grapheme clusters such as using \X or even via (?:\p{Grapheme_Base}\p{Grapheme_Extend}*). This is a really big deal.

Workarounds

Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:

  • For the missing \w, you can use [\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers, as they’re not \p{Nd}-type numbers which are the only ones that count as alphanumeric.
  • For the missing \W, you can therefore use the set-complement of the previous one, so [^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]. It overstates matters only in the enclosed numbers.
  • Since \b is really the same as (?:(?<=\w)(?!\w)|(?<!\w)(?=\w)), you could plug that \w definition into that sequence to create a Unicode-aware version of \b — provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see.
  • Since \B is really the same as (?:(?<=\w)(?=\w)|(?<!\w)(?!\w)), you could do the same, but subject to the same conditions.
  • For the missing \X, you can get sorta close by using \P{M}\p{M}*, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong.
  • For the missing \R, you can construct a work-around using (?:\r\n|[\n-\r\u0085\u2028\u2029]).

Summary

The conclusion is that JavaScript’s regexes are completely unsuited for Unicode work. However, the XRegExp plugin moves closer to making that feasible. If you can live with its restrictions, this is probably easier than switching to a different but Unicode-aware programming language. It’s certainly better than being unable to use Unicode regexes even at all.

However, it is still a rather long ways from meeting the very most basic requirements (Level 1 support) for Unicode regexes as spelled out in the standard. Someday you are going to want to be able to match characters whether they have accent marks on them or not, or which set up in the Mathematical Alphanumeric Symbols block, or which use the Unicode case-mapping and case-folding definitions, or which follow The Unicode Standard for alphanumeric sorts or for line- and word-breaking, and you cannot do any of those things in Javascript even with the plug-in.

So you might wish to consider using a language that is compliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.

like image 175
tchrist Avatar answered Sep 28 '22 08:09

tchrist


Firstly, unless the regex is dynamic, please use the /.../gi notation.

The problem it returns the wrong value is because \W in Javascript is really just [^0-9a-zA-Z_]. The accented characters like é is not considered a word character. You need to exclude them manually.

var re = /(^|[^a-zäéö])(one|tw|two two|two|twöu|three|föur)(?=[^a-zäéö]|$)/gi;
like image 27
kennytm Avatar answered Sep 28 '22 06:09

kennytm