What is the easiest way to match non-ASCII characters in a regex? I would like to match all words individually in an input string, but the language may not be English, so I will need to match things like ü, ö, ß, and ñ. Also, this is in Javascript/jQuery, so any solution will need to apply to that.
Alternatively, you can also use regular expressions to find non-ASCII characters. ASCII character set is captured using regex [A-Za-z0-9]. You can use this regex in your query as shown below, to find non-ASCII characters. mysql> SELECT * FROM data WHERE full_name NOT REGEXP '[A-Za-z0-9]';
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
This will make your regular expressions work with all Unicode regex engines. In addition to the standard notation, \p{L}, Java, Perl, PCRE, the JGsoft engine, and XRegExp 3 allow you to use the shorthand \pL. The shorthand only works with single-letter Unicode properties.
This should do it:
[^\x00-\x7F]+
It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F).
You can do the same thing with Unicode:
[^\u0000-\u007F]+
For unicode you can look at this 2 resources:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With