Need to put list of unicode words in unicode string in {}. There is my code:
var txt = "¿One;one oneé two two two two two twö twöu three;;twä;föur?";
var re = new RegExp("(^|\\W)(one|tw|two two|two|twöu|three|föur)(?=\\W|$)", "gi");
alert(txt.replace(re, '$1 {$2}'));
It returns:
¿{One};{one} {one}é {two two} {two two} {two} {tw}ö {tw}öu {three};;{tw}ä;{föur}?
but should be:
¿{One};{one} oneé {two two} {two two} {two} twö {twöu} {three};;twä;{föur}?
What I'm doing wrong?
What am I doing wrong?
Unfortunately, the answer is that you are doing nothing wrong. Javascript is.
There is, however, a rather nice library called XRegExp which has a JavaScript plugin that helps a great deal. I recommend it, albeit with several notable caveats. You need to know what it can do, and what it cannot.
split
function.\p{L}
for letters and \p{Sc}
for currency symbols.\p{Letter}
for \p{L}
and \p{Currency_Symbol}
for \p{Sc}
.\p{Latin}
, \p{Greek}
, and \p{Common}
.\p{InBasic_Latin}
and \p{InMathematical_Alphanumeric_Symbols}
.\p{Alphabetic}
, \p{Uppercase}
, \p{Lowercase}
, \p{White_Space}
, \p{Noncharacter_Code_Point}
, \p{Default_Ignorable_Code_Point}
, \p{Any}
, \p{ASCII}
, and \p{Assigned}
.(?<NAME>⋯)
to declare a named group, \k<NAME>
to backref it by name, and use ${NAME}
in the replacement pattern (and in general access it using result.NAME
in your code). This is the same syntax used by Perl 5.10, Java 7, .ɴᴇᴛ, and several other languages. It makes writing complex regexes a lot easier by letting you name parts instead of just numbering them, so that when you move stuff around you don’t have to recalculate the numbered variables./s
ᴀᴋᴀ (?s)
mode so that dot matches any single code point, rather than anything except for a linebreak sequence. Most other regex engines support this mode./x
ᴀᴋᴀ (?x)
mode so that whitespace and comments are ignored (if unescaped). Most regex engines support this mode. It is absolutely indispensable for creating legible — and hence, maintainable — patterns./x
mode using the standard (?#⋯)
notation to do so (such as seen in Perl). This lets you put comments in individual regex pieces without going all the way to /x
mode, which is often important in developing more complex patterns, by allowing you to build them up piece-wise.\a
to mean the ALERT character, or the POSIXish character classes.You should be careful, however, for the things that it does not do:
\X
for grapheme clusters, or \R
for linebreak sequences.\p{GC=Letter}
, \p{Block=Phonetic_Extensions}
, \p{Script=Greek}
, \p{Bidi_Class=Right_to_Left}
, \p{Word_Break=A_Letter}
, and \p{Numeric_Value=10}
.\s
to match the Unicode \p{White_Space}
property; it does not allow \d
to match \p{Nd}
(although some old browsers will do that anyway!) nor \w
to match [\p{Alphabetic}\pM\p{Nd}\p{Pc}]
, let alone providing Unicode-aware versions of \b
and \B
, all of which are part of the requirements for supporting Unicode Regular Expressions.\p{digit}
, and perhaps also the rather useful \p{Dash}
, \p{Math}
, \p{Diacritic}
, and \p{Quotation_Mark}
properties. \X
or even via (?:\p{Grapheme_Base}\p{Grapheme_Extend}*)
. This is a really big deal.
Here are a few workarounds to handle a few of the places where the library doesn’t follow The Unicode Standard:
\w
, you can use [\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]
. It overstates matters only in the enclosed numbers, as they’re not \p{Nd}
-type numbers which are the only ones that count as alphanumeric.\W
, you can therefore use the set-complement of the previous one, so [^\p{L}\p{Nl}\p{Nd}\p{M}\p{InEnclosedAlphanumerics}]
. It overstates matters only in the enclosed numbers.\b
is really the same as (?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
, you could plug that \w
definition into that sequence to create a Unicode-aware version of \b
— provided that JavaScript supported all four directions of lookaround, which when last I checked, it did not. You have to have both positive and negative lookbehind, not just lookahead, to do this correctly. Javascript neglects to support those, at least as far as I can see.\B
is really the same as (?:(?<=\w)(?=\w)|(?<!\w)(?!\w))
, you could do the same, but subject to the same conditions.\X
, you can get sorta close by using \P{M}\p{M}*
, but that incorrectly splits up CRLF constructs and allows marks on the same, all of which is really quite wrong.\R
, you can construct a work-around using (?:\r\n|[\n-\r\u0085\u2028\u2029])
.The conclusion is that JavaScript’s regexes are completely unsuited for Unicode work. However, the XRegExp plugin moves closer to making that feasible. If you can live with its restrictions, this is probably easier than switching to a different but Unicode-aware programming language. It’s certainly better than being unable to use Unicode regexes even at all.
However, it is still a rather long ways from meeting the very most basic requirements (Level 1 support) for Unicode regexes as spelled out in the standard. Someday you are going to want to be able to match characters whether they have accent marks on them or not, or which set up in the Mathematical Alphanumeric Symbols block, or which use the Unicode case-mapping and case-folding definitions, or which follow The Unicode Standard for alphanumeric sorts or for line- and word-breaking, and you cannot do any of those things in Javascript even with the plug-in.
So you might wish to consider using a language that is compliant with The Unicode Standard if you actually need to handle Unicode. Javascript just doesn’t manage that.
Firstly, unless the regex is dynamic, please use the /.../gi
notation.
The problem it returns the wrong value is because \W
in Javascript is really just [^0-9a-zA-Z_]
. The accented characters like é
is not considered a word character. You need to exclude them manually.
var re = /(^|[^a-zäéö])(one|tw|two two|two|twöu|three|föur)(?=[^a-zäéö]|$)/gi;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With