I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that
A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". AS3 RegExp to match words with boundry type characters in them
And since
\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml
obviously accented characters are not taken into account. This becomes a problem with words like Montréal
. If the é
is considered a word boundary, then al
is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..
Any help?
Here is the relevant JavaScript code, which searches userInput
and finds two-letter words using the re_state
regular expression:
var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
While JavaScript regexes recognize non-ASCII characters in some cases (like \s
), it's hopelessly inadequate when it comes to \w
and \b
. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.
By the way, there's an error in your regex. You have a \b
after the optional trailing comma, but it should be in front:
"\\b([a-z]{2})\\b,?"
I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b
should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:
"\\b[a-z]{2}\\b"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With