Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use locale characters in regular expressions with javascript

I guess it's easier to explain with an example:

'gracias senor'.match(/\w+/g)
["gracias", "senor"]

But if I use any non english character:

'gracias señor'.match(/\w+/g)
["gracias", "se", "or"]

Is there some way to take into account characters like ñ, á é, etc..

like image 837
opensas Avatar asked Feb 03 '14 05:02

opensas


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

Does JavaScript regex support Unicode?

As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


2 Answers

According to Wikipedia, Spanish alphabet consists of:

  • English alphabet: A-Z, a-z
  • N with diacritic tilde: ñ and Ñ
  • Accented characters: á, é, í, ó, ú, ü (and their corresponding uppercase character)

Since there are 2 ways to specify characters with diacritical marks:

  • Single glyph: á
  • With combining mark: ("a\u0341")

You will need to at least take care of such cases. Thankfully, Spanish only has at most 1 diacritical mark on the characters.

In Unicode, there are also characters that decomposes to English alphabet A-Z or a-z. Since JavaScript's RegExp has poor support for Unicode and they are rarely used anyway, I ignore those cases.

Therefore, to correctly match a Spanish alphabet (single glyph and combining mark):

[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ]

(Note that i flag is not effective on non-US-ASCII characters).


Back to the problem of matching a word. This depends on your definition of a "word character".

Let's say a "word" (Spanish) consists of Spanish alphabet, and digits 0-9:

(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+

Test code:

'gracias señor señor'.match(/(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+/g).forEach(function(v){console.log(v + " " + v.length)});

Output (matched word and length):

gracias 7
señor 5
señor 6
like image 152
nhahtdh Avatar answered Sep 25 '22 03:09

nhahtdh


You can use Unicode ranges.

'gracias señor'.match(/[\u0080-\u00FF\w]+/g)

Here's a great reference of the Unicode ranges and their escaped values.

EDIT

So I came back to reference this and curiosity got the best of me. How can I use a range of characters and be sure that only letters are used?

Below is a code snippet that uses unicode ranges to return the letters only. Using the range 0x0000 - 0x00FF returns the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

Not sure of its accuracy but it was a fun learning experiment.

function probablyIsLetter(char) {

  var result;

  //97-122 == [a-z]
  for (var i = 97; i <= 122; i += 1) {
    result = char.toLowerCase().localeCompare(String.fromCharCode(i), {
      usage: 'search',
      sensitivity: 'base'
    });
  }

  return result !== 1;

}


function getFilteredUnicodeRange(start, end) {

  var buffer = [];

  start = start || 0x0000;
  end = end || 0x09FF;

  for (var i = start; i <= end; i += 1) {
    var char = String.fromCharCode(i);
    if (char.toUpperCase() !== char.toLowerCase() && probablyIsLetter(char)) {
      buffer.push(char);
    }
  }

  return buffer.join('');

}

var characters = getFilteredUnicodeRange(0x0000, 0x00FF);
var regex = new RegExp('[' + characters + ']+', 'g');

var elementOutput = document.getElementById('example-output');
elementOutput.innerText = 'gracias señor'.match(regex);

var elementRegex = document.getElementById('example-characters');
elementRegex.innerText = characters;
<pre id="example-characters"></pre>

<pre id="example-output"></pre>
like image 42
slamborne Avatar answered Sep 22 '22 03:09

slamborne