there are some similar questions out there, but none that are quite the same or that have an answer that works for me. I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically: <blockquote> Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+IE00 to U+IEFF </blockquote> Some of the answers out there seem to check the first character in the text field but miss out others, so these are no good. This is what I have tried so far (this doesn't work!): <pre class="prettyprint"><code>var value = 'abcdef' // from text field var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string // var re = '\\w+/'; // alternative if (new RegExp(re).test(value)) { result = false; } </code></pre> The following sort of works but only for the first character: <pre class="prettyprint"><code>//var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string // couldn't get the above to work so using the following: var re = '\\w+'; if (!value.match(re)) { message = 'Please enter valid latin characters only'; $focusField = $this; } </code></pre> What is the right way to do this? I really need code, rather than an explaination, but both would be better. Thanks

EDIT: Note that the solution given in the accepted answer is incorrect. It is full of false positives and false negatives. The exact numeric code point numbers needed are given at the bottom of this post. The example given by the question mistakenly attempt to use Block rather than Script properties! You do not want to use Unicode block character properties here; you want to use Unicode script character properties. In other words, you really want <code>Script=Latin</code> and not to try to use <code>Block=Basic_Latin</code> plus <code>Block=Latin_1</code> plus <code>Block=Latin_1_Supplement</code> plus <code>Block=Latin_Extended_A</code> plus <code>Block=Latin_Extended_Additional</code>. Note also that the question neglected to other Latin blocks: <code>Block=Latin_Extended_C</code> and <code>Block=Latin_Extended_D</code>. Even if you used the correct blocks, you would get 145 false positives that were in those blocks but which were not Latin script characters: <pre class="prettyprint lang-shell prettyprint-override"><code>$ unichars '\P{Script=Latin}' '[\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B} \p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l 145 </code></pre> Furthermore, you would miss 403 false negatives that are indeed Latin script characters but which are not in those blocks: <pre class="prettyprint lang-shell prettyprint-override"><code>$ unichars '\p{Script=Latin}' '[^\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B }\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l 403 </code></pre> You virtually never want to use Blocks; you want to use Scripts. That’s why Level 1 conformance of UTS#18 requires in Requirement 1.2that the Script character property be supported, but says nothing of the Block property until Requirement 2.7: Full Properties. See UTS#18 Annex A, Character Blocks, for more pitfalls that come of using Blocks instead of Scripts. Removing the code points that lie outside the Basic Multilingual Plane due to the Javascript bug that makes it impossible to specify these by ranges, we are left with this set of insanely unmaintainable garbledy-gook needed to fish out all Unicode v6.2 code points having the Latin, Common, or Inherited script character property: <pre class="prettyprint lang-none prettyprint-override"><code>[\u0000-\u0040][\u0041-\u005A][\u005B-\u0060][\u0061-\u007A][\u007B-\u00A9]\u00AA[\u00AB-\u00B9]\u00BA[\u00BB-\u00BF][\u00C0-\u00D6]\u00D7[\u00D8-\u00 F6]\u00F7[\u00F8-\u02B8][\u02B9-\u02DF][\u02E0-\u02E4][\u02E5-\u02E9][\u02EC-\u02FF][\u0300-\u036F]\u0374\u037E\u0385\u0387[\u0485-\u0486]\u0589\u060C \u061B\u061F\u0640[\u064B-\u0655][\u0660-\u0669]\u0670\u06DD[\u0951-\u0952][\u0964-\u0965]\u0E3F[\u0FD5-\u0FD8]\u10FB[\u16EB-\u16ED][\u1735-\u1736][\u 1802-\u1803]\u1805[\u1CD0-\u1CD2]\u1CD3[\u1CD4-\u1CE0]\u1CE1[\u1CE2-\u1CE8][\u1CE9-\u1CEC]\u1CED[\u1CEE-\u1CF3]\u1CF4[\u1CF5-\u1CF6][\u1D00-\u1D25][\u 1D2C-\u1D5C][\u1D62-\u1D65][\u1D6B-\u1D77][\u1D79-\u1DBE][\u1DC0-\u1DE6][\u1DFC-\u1DFF][\u1E00-\u1EFF][\u2000-\u200B][\u200C-\u200D][\u200E-\u2064][\u 206A-\u2070]\u2071[\u2074-\u207E]\u207F[\u2080-\u208E][\u2090-\u209C][\u20A0-\u20BA][\u20D0-\u20F0][\u2100-\u2125][\u2127-\u2129][\u212A-\u212B][\u212 C-\u2131]\u2132[\u2133-\u214D]\u214E[\u214F-\u215F][\u2160-\u2188]\u2189[\u2190-\u23F3][\u2400-\u2426][\u2440-\u244A][\u2460-\u26FF][\u2701-\u27FF][\u 2900-\u2B4C][\u2B50-\u2B59][\u2C60-\u2C7F][\u2E00-\u2E3B][\u2FF0-\u2FFB][\u3000-\u3004]\u3006[\u3008-\u3020][\u302A-\u302D][\u3030-\u3037][\u303C-\u30 3F][\u3099-\u309A][\u309B-\u309C]\u30A0[\u30FB-\u30FC][\u3190-\u319F][\u31C0-\u31E3][\u3220-\u325F][\u327F-\u32CF][\u3358-\u33FF][\u4DC0-\u4DFF][\uA70 0-\uA721][\uA722-\uA787][\uA788-\uA78A][\uA78B-\uA78E][\uA790-\uA793][\uA7A0-\uA7AA][\uA7F8-\uA7FF][\uA830-\uA839][\uFB00-\uFB06][\uFD3E-\uFD3F]\uFDFD [\uFE00-\uFE0F][\uFE10-\uFE19][\uFE20-\uFE26][\uFE30-\uFE52][\uFE54-\uFE66][\uFE68-\uFE6B]\uFEFF[\uFF01-\uFF20][\uFF21-\uFF3A][\uFF3B-\uFF40][\uFF41-\ uFF5A][\uFF5B-\uFF65]\uFF70[\uFF9E-\uFF9F][\uFFE0-\uFFE6][\uFFE8-\uFFEE][\uFFF9-\uFFFD] </code></pre> Personally, I would fire anyone who attempted to use that sort of nonsense. Furthermore, 3,225 code points that you miss because of the Javascript bug in handling full Unicode are the following: <pre class="prettyprint lang-none prettyprint-override"><code>10100-10102 10107-10133 10137-1013F 10190-1019B 101D0-101FC 101FD 1D000-1D0F5 1D100-1D126 1D129-1D166 1D167-1D169 1D16A-1D17A 1D17B-1D182 1D183-1D184 1D185-1D18B 1D18C-1D1A9 1D1AA-1D1AD 1D1AE-1D1DD 1D300-1D356 1D360-1D371 1D400-1D454 1D456-1D49C 1D49E-1D49F 1D4A2 1D4A5-1D4A6 1D4A9-1D4AC 1D4AE-1D4B9 1D4BB 1D4BD-1D4C3 1D4C5-1D505 1D507-1D50A 1D50D-1D514 1D516-1D51C 1D51E-1D539 1D53B-1D53E 1D540-1D544 1D546 1D54A-1D550 1D552-1D6A5 1D6A8-1D7CB 1D7CE-1D7FF 1F000-1F02B 1F030-1F093 1F0A0-1F0AE 1F0B1-1F0BE 1F0C1-1F0CF 1F0D1-1F0DF 1F100-1F10A 1F110-1F12E 1F130-1F16B 1F170-1F19A 1F1E6-1F1FF 1F201-1F202 1F210-1F23A 1F240-1F248 1F250-1F251 1F300-1F320 1F330-1F335 1F337-1F37C 1F380-1F393 1F3A0-1F3C4 1F3C6-1F3CA 1F3E0-1F3F0 1F400-1F43E 1F440 1F442-1F4F7 1F4F9-1F4FC 1F500-1F53D 1F540-1F543 1F550-1F567 1F5FB-1F640 1F645-1F64F 1F680-1F6C5 1F700-1F773 E0001 E0020-E007F E0100-E01EF </code></pre> <h3>The correct way to do all this is included below.</h3> If you are going to be playing around with Unicode character properties, it is tantamount to hopeless to hardcode code-point numbers like this. What you really want is to be able to say something like: <pre class="prettyprint"><code>[^\p{Script=Latin}\p{Script=Common}\p{Script=Inherited}] </code></pre> However, Javascript regexes are still completely antemillennial in this regard, and are so far from complying with Unicode Technical Standard #18: Unicode Regular Expressions, even at its very most basic compliance level, level one: <blockquote> Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing. </blockquote> Because even the most rudimentary compliance level for Unicode regular expressions is still far beneath Javascript’s capabilities, I strongly recommending running whatever Unicode-aware regexes you need on the server in some language that actually supports them. However, in the event that this is not practical, a sanity-saving workaround is the Javascript XRegExp plugin, which provides a saner regex library that also allows for access to certain essential character properties such as you are attempting to use. As of v2.0, the “XRegExp All” add-on supports all these: <ul> <li>XRegExp 2.0.0</li> <li>Unicode Base 1.0.0</li> <li>Unicode Categories 1.2.0</li> <li>Unicode Scripts 1.2.0</li> <li>Unicode Blocks 1.2.0</li> <li>Unicode Properties 1.0.0</li> <li>XRegExp.matchRecursive 0.2.0</li> <li>XRegExp.build 0.1.0</li> <li>Prototypes 1.0.0</li> </ul> Which means that once you have it loaded, you will be able to get at the properties you need this way: <pre class="prettyprint"><code>XRegExp("[^\\p{Latin}\\p{Common}\\p{Inherited}]"); </code></pre> Please note very carefully that as of Unicode v6.2, any and all of the following code points and code-point ranges are deemed to have the <code>Script=Latin</code> character property: <pre class="prettyprint lang-none prettyprint-override"><code>0041-005A 0061-007A 00AA 00BA 00C0-00D6 00D8-00F6 00F8-02B8 02E0-02E4 1D00-1D25 1D2C-1D5C 1D62-1D65 1D6B-1D77 1D79-1DBE 1E00-1EFF 2071 207F 2090-209C 212A-212B 2132 214E 2160-2188 2C60-2C7F A722-A787 A78B-A78E A790-A793 A7A0-A7AA A7F8-A7FF FB00-FB06 FF21-FF3A FF41-FF5A </code></pre> Whereas these are the code points that have the <code>Script=Common</code> character property: <pre class="prettyprint lang-none prettyprint-override"><code>0000-0040 005B-0060 007B-00A9 00AB-00B9 00BB-00BF 00D7 00F7 02B9-02DF 02E5-02E9 02EC-02FF 0374 037E 0385 0387 0589 060C 061B 061F 0640 0660-0669 06DD 0964-0965 0E3F 0FD5-0FD8 10FB 16EB-16ED 1735-1736 1802-1803 1805 1CD3 1CE1 1CE9-1CEC 1CEE-1CF3 1CF5-1CF6 2000-200B 200E-2064 206A-2070 2074-207E 2080-208E 20A0-20BA 2100-2125 2127-2129 212C-2131 2133-214D 214F-215F 2189 2190-23F3 2400-2426 2440-244A 2460-26FF 2701-27FF 2900-2B4C 2B50-2B59 2E00-2E3B 2FF0-2FFB 3000-3004 3006 3008-3020 3030-3037 303C-303F 309B-309C 30A0 30FB-30FC 3190-319F 31C0-31E3 3220-325F 327F-32CF 3358-33FF 4DC0-4DFF A700-A721 A788-A78A A830-A839 FD3E-FD3F FDFD FE10-FE19 FE30-FE52 FE54-FE66 FE68-FE6B FEFF FF01-FF20 FF3B-FF40 FF5B-FF65 FF70 FF9E-FF9F FFE0-FFE6 FFE8-FFEE FFF9-FFFD 10100-10102 10107-10133 10137-1013F 10190-1019B 101D0-101FC 1D000-1D0F5 1D100-1D126 1D129-1D166 1D16A-1D17A 1D183-1D184 1D18C-1D1A9 1D1AE-1D1DD 1D300-1D356 1D360-1D371 1D400-1D454 1D456-1D49C 1D49E-1D49F 1D4A2 1D4A5-1D4A6 1D4A9-1D4AC 1D4AE-1D4B9 1D4BB 1D4BD-1D4C3 1D4C5-1D505 1D507-1D50A 1D50D-1D514 1D516-1D51C 1D51E-1D539 1D53B-1D53E 1D540-1D544 1D546 1D54A-1D550 1D552-1D6A5 1D6A8-1D7CB 1D7CE-1D7FF 1F000-1F02B 1F030-1F093 1F0A0-1F0AE 1F0B1-1F0BE 1F0C1-1F0CF 1F0D1-1F0DF 1F100-1F10A 1F110-1F12E 1F130-1F16B 1F170-1F19A 1F1E6-1F1FF 1F201-1F202 1F210-1F23A 1F240-1F248 1F250-1F251 1F300-1F320 1F330-1F335 1F337-1F37C 1F380-1F393 1F3A0-1F3C4 1F3C6-1F3CA 1F3E0-1F3F0 1F400-1F43E 1F440 1F442-1F4F7 1F4F9-1F4FC 1F500-1F53D 1F540-1F543 1F550-1F567 1F5FB-1F640 1F645-1F64F 1F680-1F6C5 1F700-1F773 E0001 E0020-E007F </code></pre> And these are the code points that have the <code>Script=Inherited</code> character property: <pre class="prettyprint lang-none prettyprint-override"><code>0300-036F 0485-0486 064B-0655 0670 0951-0952 1CD0-1CD2 1CD4-1CE0 1CE2-1CE8 1CED 1CF4 1DC0-1DE6 1DFC-1DFF 200C-200D 20D0-20F0 302A-302D 3099-309A FE00-FE0F FE20-FE26 101FD 1D167-1D169 1D17B-1D182 1D185-1D18B 1D1AA-1D1AD E0100-E01EF </code></pre> I hope the terrible maintenance, upkeep, legibility, and indeed writability problems that come of using literal code-point numbers like these make it clear that you want to at a bare minimum use the <code>XRegExp</code> add-ons.

I'm using: <pre class="prettyprint"><code>/^[A-z\u00C0-\u00ff\s'\.,-\/#!$%\^&\*;:{}=\-_`~()]+$/ </code></pre> as regular expression. I didn't test it with all the options but I've been using this for years and I never had any issue. <pre class="prettyprint"><code>var regexp = /[A-z\u00C0-\u00ff]+/g, ascii = ' hello !@#$%^&*())_+=', latin = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏàáâãäåæçèéêëìíîïÐÑÒÓÔÕÖØÙÚÛÜÝÞßðñòóôõöøùúûüýþÿ', chinese = ' 你好 '; console.log(regexp.test(ascii)); // true console.log(regexp.test(latin)); // true console.log(regexp.test(chinese)); // false </code></pre> Glist: https://gist.github.com/germanattanasio/84cd25395688b7935182

Latin Characters check

Tags:

javascript

regex

unicode

character-properties

there are some similar questions out there, but none that are quite the same or that have an answer that works for me.

I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically:

Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U+007E, U+00A0 to U+024F and U+IE00 to U+IEFF

Some of the answers out there seem to check the first character in the text field but miss out others, so these are no good.

This is what I have tried so far (this doesn't work!):

Click to copy

var value = 'abcdef' // from text field
var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// var re = '\\w+/'; // alternative
if (new RegExp(re).test(value)) {
    result = false;
}

The following sort of works but only for the first character:

Click to copy

//var re = '\u0000-\u007F|\u0100-\u017F|\u0180-\u024F|\u1E00-\u1EFF|\u0080-\u00FF'; // latin regexp string
// couldn't get the above to work so using the following:
var re = '\\w+';
if (!value.match(re)) {
    message = 'Please enter valid latin characters only';
    $focusField = $this;
}

What is the right way to do this?

I really need code, rather than an explaination, but both would be better.

Thanks

631

asked Apr 03 '13 10:04

CompanyDroneFromSector7G

2 Answers

EDIT: Note that the solution given in the accepted answer is incorrect. It is full of false positives and false negatives. The exact numeric code point numbers needed are given at the bottom of this post.

The example given by the question mistakenly attempt to use Block rather than Script properties!

You do not want to use Unicode block character properties here; you want to use Unicode script character properties. In other words, you really want Script=Latin and not to try to use Block=Basic_Latin plus Block=Latin_1 plus Block=Latin_1_Supplement plus Block=Latin_Extended_A plus Block=Latin_Extended_Additional.

Note also that the question neglected to other Latin blocks: Block=Latin_Extended_C and Block=Latin_Extended_D.

Even if you used the correct blocks, you would get 145 false positives that were in those blocks but which were not Latin script characters:

Click to copy

$ unichars '\P{Script=Latin}' '[\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B}
\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
145

Furthermore, you would miss 403 false negatives that are indeed Latin script characters but which are not in those blocks:

Click to copy

$ unichars '\p{Script=Latin}' '[^\p{Block=Basic_Latin}\p{Block=Latin_1}\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_B
}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}]' | wc -l
403

You virtually never want to use Blocks; you want to use Scripts. That’s why Level 1 conformance of UTS#18 requires in Requirement 1.2that the Script character property be supported, but says nothing of the Block property until Requirement 2.7: Full Properties.

See UTS#18 Annex A, Character Blocks, for more pitfalls that come of using Blocks instead of Scripts.

Removing the code points that lie outside the Basic Multilingual Plane due to the Javascript bug that makes it impossible to specify these by ranges, we are left with this set of insanely unmaintainable garbledy-gook needed to fish out all Unicode v6.2 code points having the Latin, Common, or Inherited script character property:

Click to copy

[\u0000-\u0040][\u0041-\u005A][\u005B-\u0060][\u0061-\u007A][\u007B-\u00A9]\u00AA[\u00AB-\u00B9]\u00BA[\u00BB-\u00BF][\u00C0-\u00D6]\u00D7[\u00D8-\u00
F6]\u00F7[\u00F8-\u02B8][\u02B9-\u02DF][\u02E0-\u02E4][\u02E5-\u02E9][\u02EC-\u02FF][\u0300-\u036F]\u0374\u037E\u0385\u0387[\u0485-\u0486]\u0589\u060C
\u061B\u061F\u0640[\u064B-\u0655][\u0660-\u0669]\u0670\u06DD[\u0951-\u0952][\u0964-\u0965]\u0E3F[\u0FD5-\u0FD8]\u10FB[\u16EB-\u16ED][\u1735-\u1736][\u
1802-\u1803]\u1805[\u1CD0-\u1CD2]\u1CD3[\u1CD4-\u1CE0]\u1CE1[\u1CE2-\u1CE8][\u1CE9-\u1CEC]\u1CED[\u1CEE-\u1CF3]\u1CF4[\u1CF5-\u1CF6][\u1D00-\u1D25][\u
1D2C-\u1D5C][\u1D62-\u1D65][\u1D6B-\u1D77][\u1D79-\u1DBE][\u1DC0-\u1DE6][\u1DFC-\u1DFF][\u1E00-\u1EFF][\u2000-\u200B][\u200C-\u200D][\u200E-\u2064][\u
206A-\u2070]\u2071[\u2074-\u207E]\u207F[\u2080-\u208E][\u2090-\u209C][\u20A0-\u20BA][\u20D0-\u20F0][\u2100-\u2125][\u2127-\u2129][\u212A-\u212B][\u212
C-\u2131]\u2132[\u2133-\u214D]\u214E[\u214F-\u215F][\u2160-\u2188]\u2189[\u2190-\u23F3][\u2400-\u2426][\u2440-\u244A][\u2460-\u26FF][\u2701-\u27FF][\u
2900-\u2B4C][\u2B50-\u2B59][\u2C60-\u2C7F][\u2E00-\u2E3B][\u2FF0-\u2FFB][\u3000-\u3004]\u3006[\u3008-\u3020][\u302A-\u302D][\u3030-\u3037][\u303C-\u30
3F][\u3099-\u309A][\u309B-\u309C]\u30A0[\u30FB-\u30FC][\u3190-\u319F][\u31C0-\u31E3][\u3220-\u325F][\u327F-\u32CF][\u3358-\u33FF][\u4DC0-\u4DFF][\uA70
0-\uA721][\uA722-\uA787][\uA788-\uA78A][\uA78B-\uA78E][\uA790-\uA793][\uA7A0-\uA7AA][\uA7F8-\uA7FF][\uA830-\uA839][\uFB00-\uFB06][\uFD3E-\uFD3F]\uFDFD
[\uFE00-\uFE0F][\uFE10-\uFE19][\uFE20-\uFE26][\uFE30-\uFE52][\uFE54-\uFE66][\uFE68-\uFE6B]\uFEFF[\uFF01-\uFF20][\uFF21-\uFF3A][\uFF3B-\uFF40][\uFF41-\
uFF5A][\uFF5B-\uFF65]\uFF70[\uFF9E-\uFF9F][\uFFE0-\uFFE6][\uFFE8-\uFFEE][\uFFF9-\uFFFD]

Personally, I would fire anyone who attempted to use that sort of nonsense.

Furthermore, 3,225 code points that you miss because of the Javascript bug in handling full Unicode are the following:

Click to copy

10100-10102 10107-10133 10137-1013F 10190-1019B 101D0-101FC 101FD
1D000-1D0F5 1D100-1D126 1D129-1D166 1D167-1D169 1D16A-1D17A 1D17B-1D182
1D183-1D184 1D185-1D18B 1D18C-1D1A9 1D1AA-1D1AD 1D1AE-1D1DD 1D300-1D356
1D360-1D371 1D400-1D454 1D456-1D49C 1D49E-1D49F 1D4A2 1D4A5-1D4A6
1D4A9-1D4AC 1D4AE-1D4B9 1D4BB 1D4BD-1D4C3 1D4C5-1D505 1D507-1D50A
1D50D-1D514 1D516-1D51C 1D51E-1D539 1D53B-1D53E 1D540-1D544 1D546
1D54A-1D550 1D552-1D6A5 1D6A8-1D7CB 1D7CE-1D7FF 1F000-1F02B 1F030-1F093
1F0A0-1F0AE 1F0B1-1F0BE 1F0C1-1F0CF 1F0D1-1F0DF 1F100-1F10A 1F110-1F12E
1F130-1F16B 1F170-1F19A 1F1E6-1F1FF 1F201-1F202 1F210-1F23A 1F240-1F248
1F250-1F251 1F300-1F320 1F330-1F335 1F337-1F37C 1F380-1F393 1F3A0-1F3C4
1F3C6-1F3CA 1F3E0-1F3F0 1F400-1F43E 1F440 1F442-1F4F7 1F4F9-1F4FC
1F500-1F53D 1F540-1F543 1F550-1F567 1F5FB-1F640 1F645-1F64F 1F680-1F6C5
1F700-1F773 E0001 E0020-E007F E0100-E01EF

The correct way to do all this is included below.

If you are going to be playing around with Unicode character properties, it is tantamount to hopeless to hardcode code-point numbers like this. What you really want is to be able to say something like:

Click to copy

[^\p{Script=Latin}\p{Script=Common}\p{Script=Inherited}]

However, Javascript regexes are still completely antemillennial in this regard, and are so far from complying with Unicode Technical Standard #18: Unicode Regular Expressions, even at its very most basic compliance level, level one:

Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.

Because even the most rudimentary compliance level for Unicode regular expressions is still far beneath Javascript’s capabilities, I strongly recommending running whatever Unicode-aware regexes you need on the server in some language that actually supports them.

However, in the event that this is not practical, a sanity-saving workaround is the Javascript XRegExp plugin, which provides a saner regex library that also allows for access to certain essential character properties such as you are attempting to use.

As of v2.0, the “XRegExp All” add-on supports all these:

XRegExp 2.0.0
Unicode Base 1.0.0
Unicode Categories 1.2.0
Unicode Scripts 1.2.0
Unicode Blocks 1.2.0
Unicode Properties 1.0.0
XRegExp.matchRecursive 0.2.0
XRegExp.build 0.1.0
Prototypes 1.0.0

Which means that once you have it loaded, you will be able to get at the properties you need this way:

Click to copy

XRegExp("[^\\p{Latin}\\p{Common}\\p{Inherited}]");

Please note very carefully that as of Unicode v6.2, any and all of the following code points and code-point ranges are deemed to have the Script=Latin character property:

Click to copy

0041-005A 
0061-007A 
00AA 
00BA 
00C0-00D6 
00D8-00F6 
00F8-02B8 
02E0-02E4 
1D00-1D25 
1D2C-1D5C 
1D62-1D65 
1D6B-1D77 
1D79-1DBE 
1E00-1EFF 
2071 
207F 
2090-209C 
212A-212B 
2132 
214E 
2160-2188 
2C60-2C7F 
A722-A787 
A78B-A78E 
A790-A793 
A7A0-A7AA 
A7F8-A7FF 
FB00-FB06 
FF21-FF3A 
FF41-FF5A

Whereas these are the code points that have the Script=Common character property:

Click to copy

0000-0040  
005B-0060  
007B-00A9  
00AB-00B9  
00BB-00BF  
00D7
00F7
02B9-02DF  
02E5-02E9  
02EC-02FF  
0374
037E
0385 
0387
0589
060C
061B
061F
0640
0660-0669  
06DD
0964-0965  
0E3F 
0FD5-0FD8  
10FB
16EB-16ED
1735-1736
1802-1803
1805
1CD3
1CE1
1CE9-1CEC
1CEE-1CF3
1CF5-1CF6
2000-200B
200E-2064
206A-2070  
2074-207E  
2080-208E  
20A0-20BA  
2100-2125
2127-2129
212C-2131  
2133-214D  
214F-215F  
2189
2190-23F3
2400-2426
2440-244A
2460-26FF
2701-27FF
2900-2B4C
2B50-2B59
2E00-2E3B
2FF0-2FFB  
3000-3004
3006
3008-3020
3030-3037  
303C-303F
309B-309C
30A0
30FB-30FC
3190-319F
31C0-31E3
3220-325F
327F-32CF
3358-33FF
4DC0-4DFF
A700-A721
A788-A78A
A830-A839
FD3E-FD3F  
FDFD
FE10-FE19  
FE30-FE52
FE54-FE66
FE68-FE6B  
FEFF
FF01-FF20  
FF3B-FF40
FF5B-FF65
FF70
FF9E-FF9F
FFE0-FFE6
FFE8-FFEE
FFF9-FFFD
10100-10102
10107-10133
10137-1013F
10190-1019B
101D0-101FC
1D000-1D0F5
1D100-1D126
1D129-1D166
1D16A-1D17A
1D183-1D184
1D18C-1D1A9
1D1AE-1D1DD
1D300-1D356
1D360-1D371
1D400-1D454
1D456-1D49C
1D49E-1D49F
1D4A2
1D4A5-1D4A6
1D4A9-1D4AC
1D4AE-1D4B9
1D4BB
1D4BD-1D4C3
1D4C5-1D505
1D507-1D50A
1D50D-1D514
1D516-1D51C
1D51E-1D539
1D53B-1D53E
1D540-1D544
1D546
1D54A-1D550
1D552-1D6A5
1D6A8-1D7CB
1D7CE-1D7FF
1F000-1F02B
1F030-1F093
1F0A0-1F0AE
1F0B1-1F0BE
1F0C1-1F0CF
1F0D1-1F0DF
1F100-1F10A
1F110-1F12E
1F130-1F16B
1F170-1F19A
1F1E6-1F1FF
1F201-1F202
1F210-1F23A
1F240-1F248
1F250-1F251
1F300-1F320
1F330-1F335
1F337-1F37C
1F380-1F393
1F3A0-1F3C4
1F3C6-1F3CA
1F3E0-1F3F0
1F400-1F43E
1F440
1F442-1F4F7
1F4F9-1F4FC
1F500-1F53D
1F540-1F543
1F550-1F567
1F5FB-1F640
1F645-1F64F
1F680-1F6C5
1F700-1F773
E0001
E0020-E007F

And these are the code points that have the Script=Inherited character property:

Click to copy

0300-036F
0485-0486
064B-0655
0670
0951-0952
1CD0-1CD2
1CD4-1CE0
1CE2-1CE8
1CED
1CF4
1DC0-1DE6
1DFC-1DFF
200C-200D
20D0-20F0
302A-302D
3099-309A
FE00-FE0F
FE20-FE26
101FD
1D167-1D169
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
E0100-E01EF

I hope the terrible maintenance, upkeep, legibility, and indeed writability problems that come of using literal code-point numbers like these make it clear that you want to at a bare minimum use the XRegExp add-ons.

143

answered Sep 22 '22 00:09

tchrist

I'm using:

Click to copy

/^[A-z\u00C0-\u00ff\s'\.,-\/#!$%\^&\*;:{}=\-_`~()]+$/

as regular expression. I didn't test it with all the options but I've been using this for years and I never had any issue.

Click to copy

var regexp = /[A-z\u00C0-\u00ff]+/g,
  ascii = ' hello !@#$%^&*())_+=',
  latin = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏàáâãäåæçèéêëìíîïÐÑÒÓÔÕÖØÙÚÛÜÝÞßðñòóôõöøùúûüýþÿ',
  chinese = ' 你 好 ';

console.log(regexp.test(ascii)); // true
console.log(regexp.test(latin)); // true
console.log(regexp.test(chinese)); // false

Glist: https://gist.github.com/germanattanasio/84cd25395688b7935182

answered Sep 20 '22 00:09

German Attanasio

Related questions
                            
                                How do I get an addresses latitude-longitude using HTML5 geolocation or Google API?
                            
                                Call native browser function, even after it has been overridden
                            
                                How to use JQuery selectors from dynamic Elements created from.Append function?
                            
                                why does a web audio oscillator only play a note once?
                            
                                Redirect using htaccess based on referrer
                            
                                Regular expression that matches number in string but not percentages
                            
                                autocomplete with contenteditable div instead of textarea doesn't seem to work
                            
                                Highcharts - How to start x axis from an arbitrary value
                            
                                loading a backup copy of jQuery when CDN is down
                            
                                How do I do run a regex on a regex in javascript?
                            
                                Responsive, multi-handle HTML 5 or javascript range slider
                            
                                In Angular, how do I efficiently split input items into an array
                            
                                Bootstrap Nested Collapse With Events Only On Parent Collapse
                            
                                The view is not updated when the model updates in AngularJS
                            
                                WebGL using gl-matrix library mat4.translate not running
                            
                                How can I change out an image using CamanJS?
                            
                                Highstock - irregular time interval
                            
                                How to block IE8 and down?
                            
                                Javascript - User input through HTML input tag to set a Javascript variable?
                            
                                Add a space between two words

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Latin Characters check

Tags:

javascript

regex

unicode

character-properties

CompanyDroneFromSector7G

People also ask

2 Answers

The correct way to do all this is included below.

tchrist

German Attanasio

Recent Activity

Donate For Us