Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JavaScript validation issue with international characters

Tags:

We use the excellent validator plugin for jQuery here on Stack Overflow to do client-side validation of input before it is submitted to the server.

It generally works well, however, this one has us scratching our heads.

The following validator method is used on the ask/answer form for the user name field (note that you must be logged out to see this field on the live site; it's on every /question page and the /ask page)

$.validator.addMethod("validUserName",   function(value, element) {   return this.optional(element) ||    /^[\w\-\s\dÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/.test(value); },   "Can only contain A-Z, 0-9, spaces, and hyphens.");   

Now this regex looks weird but it's pretty simple:

  • match the beginning of the string (^)
  • match any of these..
    • word character (\w)
    • dash (-)
    • space (\s)
    • digit (\d)
    • crazy moon language characters (àèìòù etc)
  • now match the end of the string ($)

Yes, we ran into the Internationalized Regular Expressions problem. JavaScript's definition of "word character" does not include international characters.. at all.

Here's the weird part: even though we've gone to the trouble of manually adding tons of the valid international characters to the regex, it doesn't work. You cannot enter these international characters in the input box for user name without getting the..

Can only contain A-Z, 0-9, spaces, and hyphens

.. validation return!

Obviously the validation is working for the other parts of the regex.. so.. what gives?

The other strange part is that this validation works in the browser's JavaScript console but not when executed as a part of our standard *.js includes.

/^[\w-\sÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöüçÇßØøÅåÆæÞþÐð]+$/ .test('ÓBill de hÓra') === true

We've run into some really bizarre international character issues in JavaScript code before, resulting in some very, very nasty hacks. We'd like to understand what's going on here and why. Please enlighten us!

like image 410
Jeff Atwood Avatar asked Jul 02 '09 09:07

Jeff Atwood


People also ask

How do you validate a special character in JavaScript?

To check if a string contains special characters, call the test() method on a regular expression that matches any special character. The test method will return true if the string contains at least 1 special character and false otherwise.

How do I restrict special characters in Onkeypress?

This blog shows how to restrict users from entering space and special character in textbox using Javascript. Now access the JavaScript function on keypress event of textbox as shown below: <asp:TextBox ID="TextBox2" runat="server" onkeypress="RestrictSpaceSpecial();" />

How do I block or restrict special characters from input fields with JavaScript?

checkSpcialChar function will restrict the special characters in the input box. We need to pass the event as a parameter for that function. We also change keycodes to allow or disallow more keys.

What is DOM validation?

Constraint Validation DOM PropertiesContains boolean properties related to the validity of an input element. validationMessage. Contains the message a browser will display when the validity is false. willValidate. Indicates if an input element will be validated.


1 Answers

I think the email and url validation methods are a good reference here, eg. the email method:

email: function(value, element) {     return this.optional(element) || /^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i.test(value); }, 

The script to compile that regex.

In other words, replacing your arbitrary list of "crazy moon" characters with this could help:

[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] 

Basically this avoids the character encoding issues you have elsewhere by replacing the needs-encoding characters with more general definitions. While not necessarily more readable, so far it's shorter than your full list.

like image 121
Jörn Zaefferer Avatar answered Jan 03 '23 10:01

Jörn Zaefferer