Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python .lower does not seem to properly lowercase all unicode characters (Python 2.7)

I'm trying to normalize latin unicode characters to their lower case forms. I thought I could use the .lower function on a unicode object, but it seems that some parts of unicode are not covered - in particular this block has some characters that do not lowercase: 0xa7a0, 0xa7a2, 0xa7b6, 0xa79e, 0xa79a, 0xa790 all return the same character when .lower is called. I don't anticipate seeing a lot of these characters, so it would be a waste of time to go through all the relevant blocks to fix the ones that don't convert correctly. Is there a more complete function to lowercase unicode, or is this a locale issue?

like image 675
faiuwle Avatar asked Mar 15 '16 17:03

faiuwle


1 Answers

The latest version of Python 2 (2.7.11) uses version 5.2.0 of the Unicode Character Database, whereas the latest Python 3 (3.5.1) uses version 8.0.0. Earlier versions of Python 3 all use UCD version 6.x.x.

A diff of the tables between 5.2.0 and 8.0.0, shows the following list of missing characters in the range (A720-A7FF) you are interested in:

-A78D;LATIN CAPITAL LETTER TURNED H;Lu;0;L;;;;;N;;;;0265;
-A78E;LATIN SMALL LETTER L WITH RETROFLEX HOOK AND BELT;Ll;0;L;;;;;N;;;;;
-A78F;LATIN LETTER SINOLOGICAL DOT;Lo;0;L;;;;;N;;;;;
-A790;LATIN CAPITAL LETTER N WITH DESCENDER;Lu;0;L;;;;;N;;;;A791;
-A791;LATIN SMALL LETTER N WITH DESCENDER;Ll;0;L;;;;;N;;;A790;;A790
-A792;LATIN CAPITAL LETTER C WITH BAR;Lu;0;L;;;;;N;;;;A793;
-A793;LATIN SMALL LETTER C WITH BAR;Ll;0;L;;;;;N;;;A792;;A792
-A794;LATIN SMALL LETTER C WITH PALATAL HOOK;Ll;0;L;;;;;N;;;;;
-A795;LATIN SMALL LETTER H WITH PALATAL HOOK;Ll;0;L;;;;;N;;;;;
-A796;LATIN CAPITAL LETTER B WITH FLOURISH;Lu;0;L;;;;;N;;;;A797;
-A797;LATIN SMALL LETTER B WITH FLOURISH;Ll;0;L;;;;;N;;;A796;;A796
-A798;LATIN CAPITAL LETTER F WITH STROKE;Lu;0;L;;;;;N;;;;A799;
-A799;LATIN SMALL LETTER F WITH STROKE;Ll;0;L;;;;;N;;;A798;;A798
-A79A;LATIN CAPITAL LETTER VOLAPUK AE;Lu;0;L;;;;;N;;;;A79B;
-A79B;LATIN SMALL LETTER VOLAPUK AE;Ll;0;L;;;;;N;;;A79A;;A79A
-A79C;LATIN CAPITAL LETTER VOLAPUK OE;Lu;0;L;;;;;N;;;;A79D;
-A79D;LATIN SMALL LETTER VOLAPUK OE;Ll;0;L;;;;;N;;;A79C;;A79C
-A79E;LATIN CAPITAL LETTER VOLAPUK UE;Lu;0;L;;;;;N;;;;A79F;
-A79F;LATIN SMALL LETTER VOLAPUK UE;Ll;0;L;;;;;N;;;A79E;;A79E
-A7A0;LATIN CAPITAL LETTER G WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A1;
-A7A1;LATIN SMALL LETTER G WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A0;;A7A0
-A7A2;LATIN CAPITAL LETTER K WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A3;
-A7A3;LATIN SMALL LETTER K WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A2;;A7A2
-A7A4;LATIN CAPITAL LETTER N WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A5;
-A7A5;LATIN SMALL LETTER N WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A4;;A7A4
-A7A6;LATIN CAPITAL LETTER R WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A7;
-A7A7;LATIN SMALL LETTER R WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A6;;A7A6
-A7A8;LATIN CAPITAL LETTER S WITH OBLIQUE STROKE;Lu;0;L;;;;;N;;;;A7A9;
-A7A9;LATIN SMALL LETTER S WITH OBLIQUE STROKE;Ll;0;L;;;;;N;;;A7A8;;A7A8
-A7AA;LATIN CAPITAL LETTER H WITH HOOK;Lu;0;L;;;;;N;;;;0266;
-A7AB;LATIN CAPITAL LETTER REVERSED OPEN E;Lu;0;L;;;;;N;;;;025C;
-A7AC;LATIN CAPITAL LETTER SCRIPT G;Lu;0;L;;;;;N;;;;0261;
-A7AD;LATIN CAPITAL LETTER L WITH BELT;Lu;0;L;;;;;N;;;;026C;
-A7B0;LATIN CAPITAL LETTER TURNED K;Lu;0;L;;;;;N;;;;029E;
-A7B1;LATIN CAPITAL LETTER TURNED T;Lu;0;L;;;;;N;;;;0287;
-A7B2;LATIN CAPITAL LETTER J WITH CROSSED-TAIL;Lu;0;L;;;;;N;;;;029D;
-A7B3;LATIN CAPITAL LETTER CHI;Lu;0;L;;;;;N;;;;AB53;
-A7B4;LATIN CAPITAL LETTER BETA;Lu;0;L;;;;;N;;;;A7B5;
-A7B5;LATIN SMALL LETTER BETA;Ll;0;L;;;;;N;;;A7B4;;A7B4
-A7B6;LATIN CAPITAL LETTER OMEGA;Lu;0;L;;;;;N;;;;A7B7;
-A7B7;LATIN SMALL LETTER OMEGA;Ll;0;L;;;;;N;;;A7B6;;A7B6
-A7F7;LATIN EPIGRAPHIC LETTER SIDEWAYS I;Lo;0;L;;;;;N;;;;;
-A7F8;MODIFIER LETTER CAPITAL H WITH STROKE;Lm;0;L;<super> 0126;;;;N;;;;;
-A7F9;MODIFIER LETTER SMALL LIGATURE OE;Lm;0;L;<super> 0153;;;;N;;;;;
-A7FA;LATIN LETTER SMALL CAPITAL TURNED M;Ll;0;L;;;;;N;;;;;

Which includes all of the characters you mention in your question, and obviously many more.

Unfortunately, there is nothing you can do to make python's lower/upper use a newer version of the UCD - although you can get an updated version of the unicodedata module by installing unicodedata2. But that doesn't really help you much.

It looks like your only real choices are to roll your own lower/upper functions, or switch to the latest version of Python 3.

like image 93
ekhumoro Avatar answered Nov 04 '22 14:11

ekhumoro