I guess it's easier to explain with an example: <pre class="prettyprint"><code>'gracias senor'.match(/\w+/g) ["gracias", "senor"] </code></pre> But if I use any non english character: <pre class="prettyprint"><code>'gracias señor'.match(/\w+/g) ["gracias", "se", "or"] </code></pre> Is there some way to take into account characters like ñ, á é, etc..

According to Wikipedia, Spanish alphabet consists of: <ul> <li>English alphabet: <code>A-Z</code>, <code>a-z</code> </li> <li>N with diacritic tilde: <code>ñ</code> and <code>Ñ</code> </li> <li>Accented characters: <code>á</code>, <code>é</code>, <code>í</code>, <code>ó</code>, <code>ú</code>, <code>ü</code> (and their corresponding uppercase character)</li> </ul> Since there are 2 ways to specify characters with diacritical marks: <ul> <li>Single glyph: <code>á</code> </li> <li>With combining mark: <code>á</code> (<code>"a\u0341"</code>)</li> </ul> You will need to at least take care of such cases. Thankfully, Spanish only has at most 1 diacritical mark on the characters. In Unicode, there are also characters that decomposes to English alphabet <code>A-Z</code> or <code>a-z</code>. Since JavaScript's RegExp has poor support for Unicode and they are rarely used anyway, I ignore those cases. Therefore, to correctly match a Spanish alphabet (single glyph and combining mark): <pre class="prettyprint"><code>[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ] </code></pre> (Note that <code>i</code> flag is not effective on non-US-ASCII characters). <hr> Back to the problem of matching a word. This depends on your definition of a "word character". Let's say a "word" (Spanish) consists of Spanish alphabet, and digits <code>0-9</code>: <pre class="prettyprint"><code>(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+ </code></pre> Test code: <pre class="prettyprint"><code>'gracias señor señor'.match(/(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+/g).forEach(function(v){console.log(v + " " + v.length)}); </code></pre> Output (matched word and length): <pre class="prettyprint"><code>gracias 7 señor 5 señor 6 </code></pre>

Use locale characters in regular expressions with javascript

Tags:

javascript

regex

locale

I guess it's easier to explain with an example:

'gracias senor'.match(/\w+/g)
["gracias", "senor"]

But if I use any non english character:

'gracias señor'.match(/\w+/g)
["gracias", "se", "or"]

Is there some way to take into account characters like ñ, á é, etc..

837

asked Feb 03 '14 05:02

opensas

2 Answers

According to Wikipedia, Spanish alphabet consists of:

English alphabet: A-Z, a-z
N with diacritic tilde: ñ and Ñ
Accented characters: á, é, í, ó, ú, ü (and their corresponding uppercase character)

Since there are 2 ways to specify characters with diacritical marks:

Single glyph: á
With combining mark: á ("a\u0341")

You will need to at least take care of such cases. Thankfully, Spanish only has at most 1 diacritical mark on the characters.

In Unicode, there are also characters that decomposes to English alphabet A-Z or a-z. Since JavaScript's RegExp has poor support for Unicode and they are rarely used anyway, I ignore those cases.

Therefore, to correctly match a Spanish alphabet (single glyph and combining mark):

[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ]

(Note that i flag is not effective on non-US-ASCII characters).

Back to the problem of matching a word. This depends on your definition of a "word character".

Let's say a "word" (Spanish) consists of Spanish alphabet, and digits 0-9:

(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+

Test code:

'gracias señor señor'.match(/(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+/g).forEach(function(v){console.log(v + " " + v.length)});

Output (matched word and length):

gracias 7
señor 5
señor 6

152

answered Sep 25 '22 03:09

nhahtdh

You can use Unicode ranges.

'gracias señor'.match(/[\u0080-\u00FF\w]+/g)

Here's a great reference of the Unicode ranges and their escaped values.

EDIT

So I came back to reference this and curiosity got the best of me. How can I use a range of characters and be sure that only letters are used?

Below is a code snippet that uses unicode ranges to return the letters only. Using the range 0x0000 - 0x00FF returns the following characters: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

Not sure of its accuracy but it was a fun learning experiment.

function probablyIsLetter(char) {

  var result;

  //97-122 == [a-z]
  for (var i = 97; i <= 122; i += 1) {
    result = char.toLowerCase().localeCompare(String.fromCharCode(i), {
      usage: 'search',
      sensitivity: 'base'
    });
  }

  return result !== 1;

}


function getFilteredUnicodeRange(start, end) {

  var buffer = [];

  start = start || 0x0000;
  end = end || 0x09FF;

  for (var i = start; i <= end; i += 1) {
    var char = String.fromCharCode(i);
    if (char.toUpperCase() !== char.toLowerCase() && probablyIsLetter(char)) {
      buffer.push(char);
    }
  }

  return buffer.join('');

}

var characters = getFilteredUnicodeRange(0x0000, 0x00FF);
var regex = new RegExp('[' + characters + ']+', 'g');

var elementOutput = document.getElementById('example-output');
elementOutput.innerText = 'gracias señor'.match(regex);

var elementRegex = document.getElementById('example-characters');
elementRegex.innerText = characters;

<pre id="example-characters"></pre>

<pre id="example-output"></pre>

answered Sep 22 '22 03:09

slamborne

Related questions
                            
                                How do I make Require.js fetch a script that does not end in `.js`? [duplicate]
                            
                                ReferenceError: Intl is not defined in Node.js
                            
                                Nested forms with django
                            
                                Print array list values with loop to div dynamically
                            
                                Google Apps Script: how to get the number of columns in a sheet? (alternative way)
                            
                                Disable javascript from running in javafx webview
                            
                                regarding sequence of control flow in html <script>
                            
                                Symfony2 form validation in Ajax
                            
                                ActionController::RoutingError (No route matches [GET] "/members/js/bootstrap.min.js"):(Missing Asset files)
                            
                                XDomainRequest (CORS) for XML causing "Access is denied" error in IE8 / IE9
                            
                                Emberjs : how to display loading spinner and notification messages on model operations
                            
                                Copy to new array and remove element?
                            
                                Javascript run faster if console opened
                            
                                Determine if string format is "May 16, 2013" or UNIX Timestamp with Javascript
                            
                                What is the best way to declare script tag in HTML5?
                            
                                Isolate Scope "=" binding and doted notation AngularJS
                            
                                Angular-UI-Router - getting content of dynamic template
                            
                                User.Agent for GWT 2.6?
                            
                                Concept of Math.floor(Math.random() * 5 + 1), what is the true range and why?
                            
                                AngularJS $resource response being returned as array of characters from ExpressJS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With