I guess it's easier to explain with an example:
'gracias senor'.match(/\w+/g)
["gracias", "senor"]
But if I use any non english character:
'gracias señor'.match(/\w+/g)
["gracias", "se", "or"]
Is there some way to take into account characters like ñ, á é, etc..
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
As mentioned in other answers, JavaScript regexes have no support for Unicode character classes.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
According to Wikipedia, Spanish alphabet consists of:
A-Z
, a-z
ñ
and Ñ
á
, é
, í
, ó
, ú
, ü
(and their corresponding uppercase character)Since there are 2 ways to specify characters with diacritical marks:
á
á
("a\u0341"
)You will need to at least take care of such cases. Thankfully, Spanish only has at most 1 diacritical mark on the characters.
In Unicode, there are also characters that decomposes to English alphabet A-Z
or a-z
. Since JavaScript's RegExp has poor support for Unicode and they are rarely used anyway, I ignore those cases.
Therefore, to correctly match a Spanish alphabet (single glyph and combining mark):
[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ]
(Note that i
flag is not effective on non-US-ASCII characters).
Back to the problem of matching a word. This depends on your definition of a "word character".
Let's say a "word" (Spanish) consists of Spanish alphabet, and digits 0-9
:
(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+
Test code:
'gracias señor señor'.match(/(?:[aeiouAEIOU]\u0341|[uU]\u0308|[nN]\u0303|[a-zA-ZáéíóúüÁÉÍÓÚÜñÑ0-9])+/g).forEach(function(v){console.log(v + " " + v.length)});
Output (matched word and length):
gracias 7
señor 5
señor 6
You can use Unicode ranges.
'gracias señor'.match(/[\u0080-\u00FF\w]+/g)
Here's a great reference of the Unicode ranges and their escaped values.
EDIT
So I came back to reference this and curiosity got the best of me. How can I use a range of characters and be sure that only letters are used?
Below is a code snippet that uses unicode ranges to return the letters only. Using the range 0x0000
- 0x00FF
returns the following characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ
Not sure of its accuracy but it was a fun learning experiment.
function probablyIsLetter(char) {
var result;
//97-122 == [a-z]
for (var i = 97; i <= 122; i += 1) {
result = char.toLowerCase().localeCompare(String.fromCharCode(i), {
usage: 'search',
sensitivity: 'base'
});
}
return result !== 1;
}
function getFilteredUnicodeRange(start, end) {
var buffer = [];
start = start || 0x0000;
end = end || 0x09FF;
for (var i = start; i <= end; i += 1) {
var char = String.fromCharCode(i);
if (char.toUpperCase() !== char.toLowerCase() && probablyIsLetter(char)) {
buffer.push(char);
}
}
return buffer.join('');
}
var characters = getFilteredUnicodeRange(0x0000, 0x00FF);
var regex = new RegExp('[' + characters + ']+', 'g');
var elementOutput = document.getElementById('example-output');
elementOutput.innerText = 'gracias señor'.match(regex);
var elementRegex = document.getElementById('example-characters');
elementRegex.innerText = characters;
<pre id="example-characters"></pre>
<pre id="example-output"></pre>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With