Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why this regex is not working for german words?

I am trying to break the following sentence in words and wrap them in span.

<p class="german_p big">Das ist ein schönes Armband</p>

I followed this: How to get a word under cursor using JavaScript?

$('p').each(function() {
            var $this = $(this);
            $this.html($this.text().replace(/\b(\w+)\b/g, "<span>$1</span>"));
        });

The only problem i am facing is, after wrapping the words in span the resultant html is like this:

<p class="german_p big"><span>Das</span> <span>ist</span> <span>ein</span> <span>sch</span>ö<span>nes</span> <span>Armband</span>.</p>

so, schönes is broken into three words sch, ö and nes. why this is happening? What could be the correct regex for this?

like image 405
Rakesh Juyal Avatar asked Oct 28 '10 13:10

Rakesh Juyal


People also ask

Why my regex is not working?

You regex doesn't work, because you're matching the beginning of the line, followed by one or more word-characters (doesn't matter if you use the non-capturing group (?:…) here or not), followed by any characters.

Does JavaScript support regex?

Using regular expressions in JavaScript. Regular expressions are used with the RegExp methods test() and exec() and with the String methods match() , replace() , search() , and split() . Executes a search for a match in a string. It returns an array of information or null on a mismatch.


2 Answers

Unicode in Javascript Regexen

Like Java itself, Javascript doesn't support Unicode in its \w, \d, and \b regex shortcuts. This is (arguably) a bug in Java and Javascript. Even if one manages through casuistry or obstinacy to argue that it is not a bug, it's sure a big gotcha. Kinda bites, really.

The problem is that those popular regex shortcuts only apply to 7-bit ASCII whether in Java or in Javascript. This restriction is painfully 1970s‐ish; it makes absolutely no sense in the 21ˢᵗ century. This blog posting from this past March makes a good argument for fixing this problem in Javascript.

It would be really nice if some public-spirited soul would please add Javascript to this Wikipedia page that compares the support regex features in various languages.

This page says that Javascript doesn't support any Unicode properties at all. That same site has a table that's a lot more detailed than the Wikipedia page I mention above. For Javascript features, look under its ECMA column.

However, that table is in some cases at least five years out of date, so I can't completely vouch for it. It's a good start, though.

Unicode Support in Other Languages

Ruby, Python, Perl, and PCRE all offer ways to extend \w to mean what it is supposed to mean, but the two J‐thingies do not.

In Java, however, there is a good workaround available. There, you can use \pL to mean any character that has the Unicode General_Category=Letter property. That means you can always emulate a proper \w using [\pL\p{Nd}_].

Indeed, there's even an advantage to writing it that way, because it keeps you aware that you're adding decimal numbers and the underscore character to the character class. With a simple \w, please sometimes forget this is going on.

I don't believe that this workaround is available in Javascript, though. You can also use Unicode properties like those in Perl and PCRE, and in Ruby 1.9, but not in Python.

The only Unicode properties current Java supports are the one- and two-character general properties like \pN and \p{Lu} and the block properties like \p{InAncientSymbols}, but not scripts like \p{IsGreek}, etc.

The future JDK7 will finally get around to adding scripts. Even then Java still won't support most of the Unicode properties, though, not even critical ones like \p{WhiteSpace} or handy ones like \p{Dash} and \p{Quotation_Mark}.

SIGH! To understand just how limited Java's property support is, merely compare it with Perl. Perl supports 1633 Unicode properties as of 2007's 5.10 release, and 2478 of them as of this year's 5.12 release. I haven't counted them for ancient releases, but Perl started supporting Unicode properties back during the last millennium.

Lame as Java is, it's still better than Javascript, because Javascript doesn't support any Unicode properties whatsoCENSOREDever. I'm afraid that Javascript's paltry 7-bit mindset makes it pretty close to unusable for Unicode. This is a tremendously huge gaping hole in the language that's extremely difficult to account for given its target domain.

Sorry 'bout that. ☹

like image 180
tchrist Avatar answered Sep 23 '22 23:09

tchrist


You can also use

/\b([äöüÄÖÜß\w]+)\b/g

instead of

/\b(\w+)\b/g

in order to handle the umlauts

like image 24
XViD Avatar answered Sep 21 '22 23:09

XViD