Rust's char
has a to_lowercase
function which seems to return the struct ToLowercase
which seems to be an iterator with always one element.
Wouldn't returning a char
directly be far more natural and simple?
Wouldn't returning a char directly be far more natural and simple?
Natural, simple, and wrong. Unicode is too complicated for that to work. The fundamental issue is that a char
is not sufficient to always represent a single, logically complete "character", for some definitions of "character".
which seems to be an iterator with always one element.
This can be trivially demonstrated to be wrong by running a simple program that upper- and lower-cases every valid Unicode codepoint. The program:
/*!
Add the following to a `Cargo.toml` file:
```cargo
[dependencies]
arrayvec = "0.3.15"
```
*/
extern crate arrayvec;
use arrayvec::ArrayVec;
fn main() {
let mut expanded_lcs = 0;
let mut expanded_ucs = 0;
let usvs = (0..0xd7ff).chain(0xe000..0x10ffff)
.flat_map(|v| std::char::from_u32(v).into_iter());
for c in usvs {
let lc: ArrayVec<[_; 4]> = c.to_lowercase().collect();
let uc: ArrayVec<[_; 4]> = c.to_uppercase().collect();
if lc.len() != 1 {
expanded_lcs += 1;
print!("'{}' U+{:04X} L -> ", c, c as u32);
for c in lc {
print!("'{}' U+{:04X} ", c, c as u32);
}
println!("");
}
if uc.len() != 1 {
expanded_ucs += 1;
print!("'{}' U+{:04X} U -> ", c, c as u32);
for c in uc {
print!("'{}' U+{:04X} ", c, c as u32);
}
println!("");
}
}
println!("\n-----\n");
println!("Found {} chars with expanded lowercase conversions.", expanded_lcs);
println!("Found {} chars with expanded uppercase conversions.", expanded_ucs);
}
And its output, given a rustc
1.8 nightly:
'ß' U+00DF U -> 'S' U+0053 'S' U+0053
'İ' U+0130 L -> 'i' U+0069 '̇' U+0307
'ʼn' U+0149 U -> 'ʼ' U+02BC 'N' U+004E
'ǰ' U+01F0 U -> 'J' U+004A '̌' U+030C
'ΐ' U+0390 U -> 'Ι' U+0399 '̈' U+0308 '́' U+0301
'ΰ' U+03B0 U -> 'Υ' U+03A5 '̈' U+0308 '́' U+0301
'և' U+0587 U -> 'Ե' U+0535 'Ւ' U+0552
'ẖ' U+1E96 U -> 'H' U+0048 '̱' U+0331
'ẗ' U+1E97 U -> 'T' U+0054 '̈' U+0308
'ẘ' U+1E98 U -> 'W' U+0057 '̊' U+030A
'ẙ' U+1E99 U -> 'Y' U+0059 '̊' U+030A
'ẚ' U+1E9A U -> 'A' U+0041 'ʾ' U+02BE
'ὐ' U+1F50 U -> 'Υ' U+03A5 '̓' U+0313
'ὒ' U+1F52 U -> 'Υ' U+03A5 '̓' U+0313 '̀' U+0300
'ὔ' U+1F54 U -> 'Υ' U+03A5 '̓' U+0313 '́' U+0301
'ὖ' U+1F56 U -> 'Υ' U+03A5 '̓' U+0313 '͂' U+0342
'ᾀ' U+1F80 U -> 'Ἀ' U+1F08 'Ι' U+0399
'ᾁ' U+1F81 U -> 'Ἁ' U+1F09 'Ι' U+0399
'ᾂ' U+1F82 U -> 'Ἂ' U+1F0A 'Ι' U+0399
'ᾃ' U+1F83 U -> 'Ἃ' U+1F0B 'Ι' U+0399
'ᾄ' U+1F84 U -> 'Ἄ' U+1F0C 'Ι' U+0399
'ᾅ' U+1F85 U -> 'Ἅ' U+1F0D 'Ι' U+0399
'ᾆ' U+1F86 U -> 'Ἆ' U+1F0E 'Ι' U+0399
'ᾇ' U+1F87 U -> 'Ἇ' U+1F0F 'Ι' U+0399
'ᾈ' U+1F88 U -> 'Ἀ' U+1F08 'Ι' U+0399
'ᾉ' U+1F89 U -> 'Ἁ' U+1F09 'Ι' U+0399
'ᾊ' U+1F8A U -> 'Ἂ' U+1F0A 'Ι' U+0399
'ᾋ' U+1F8B U -> 'Ἃ' U+1F0B 'Ι' U+0399
'ᾌ' U+1F8C U -> 'Ἄ' U+1F0C 'Ι' U+0399
'ᾍ' U+1F8D U -> 'Ἅ' U+1F0D 'Ι' U+0399
'ᾎ' U+1F8E U -> 'Ἆ' U+1F0E 'Ι' U+0399
'ᾏ' U+1F8F U -> 'Ἇ' U+1F0F 'Ι' U+0399
'ᾐ' U+1F90 U -> 'Ἠ' U+1F28 'Ι' U+0399
'ᾑ' U+1F91 U -> 'Ἡ' U+1F29 'Ι' U+0399
'ᾒ' U+1F92 U -> 'Ἢ' U+1F2A 'Ι' U+0399
'ᾓ' U+1F93 U -> 'Ἣ' U+1F2B 'Ι' U+0399
'ᾔ' U+1F94 U -> 'Ἤ' U+1F2C 'Ι' U+0399
'ᾕ' U+1F95 U -> 'Ἥ' U+1F2D 'Ι' U+0399
'ᾖ' U+1F96 U -> 'Ἦ' U+1F2E 'Ι' U+0399
'ᾗ' U+1F97 U -> 'Ἧ' U+1F2F 'Ι' U+0399
'ᾘ' U+1F98 U -> 'Ἠ' U+1F28 'Ι' U+0399
'ᾙ' U+1F99 U -> 'Ἡ' U+1F29 'Ι' U+0399
'ᾚ' U+1F9A U -> 'Ἢ' U+1F2A 'Ι' U+0399
'ᾛ' U+1F9B U -> 'Ἣ' U+1F2B 'Ι' U+0399
'ᾜ' U+1F9C U -> 'Ἤ' U+1F2C 'Ι' U+0399
'ᾝ' U+1F9D U -> 'Ἥ' U+1F2D 'Ι' U+0399
'ᾞ' U+1F9E U -> 'Ἦ' U+1F2E 'Ι' U+0399
'ᾟ' U+1F9F U -> 'Ἧ' U+1F2F 'Ι' U+0399
'ᾠ' U+1FA0 U -> 'Ὠ' U+1F68 'Ι' U+0399
'ᾡ' U+1FA1 U -> 'Ὡ' U+1F69 'Ι' U+0399
'ᾢ' U+1FA2 U -> 'Ὢ' U+1F6A 'Ι' U+0399
'ᾣ' U+1FA3 U -> 'Ὣ' U+1F6B 'Ι' U+0399
'ᾤ' U+1FA4 U -> 'Ὤ' U+1F6C 'Ι' U+0399
'ᾥ' U+1FA5 U -> 'Ὥ' U+1F6D 'Ι' U+0399
'ᾦ' U+1FA6 U -> 'Ὦ' U+1F6E 'Ι' U+0399
'ᾧ' U+1FA7 U -> 'Ὧ' U+1F6F 'Ι' U+0399
'ᾨ' U+1FA8 U -> 'Ὠ' U+1F68 'Ι' U+0399
'ᾩ' U+1FA9 U -> 'Ὡ' U+1F69 'Ι' U+0399
'ᾪ' U+1FAA U -> 'Ὢ' U+1F6A 'Ι' U+0399
'ᾫ' U+1FAB U -> 'Ὣ' U+1F6B 'Ι' U+0399
'ᾬ' U+1FAC U -> 'Ὤ' U+1F6C 'Ι' U+0399
'ᾭ' U+1FAD U -> 'Ὥ' U+1F6D 'Ι' U+0399
'ᾮ' U+1FAE U -> 'Ὦ' U+1F6E 'Ι' U+0399
'ᾯ' U+1FAF U -> 'Ὧ' U+1F6F 'Ι' U+0399
'ᾲ' U+1FB2 U -> 'Ὰ' U+1FBA 'Ι' U+0399
'ᾳ' U+1FB3 U -> 'Α' U+0391 'Ι' U+0399
'ᾴ' U+1FB4 U -> 'Ά' U+0386 'Ι' U+0399
'ᾶ' U+1FB6 U -> 'Α' U+0391 '͂' U+0342
'ᾷ' U+1FB7 U -> 'Α' U+0391 '͂' U+0342 'Ι' U+0399
'ᾼ' U+1FBC U -> 'Α' U+0391 'Ι' U+0399
'ῂ' U+1FC2 U -> 'Ὴ' U+1FCA 'Ι' U+0399
'ῃ' U+1FC3 U -> 'Η' U+0397 'Ι' U+0399
'ῄ' U+1FC4 U -> 'Ή' U+0389 'Ι' U+0399
'ῆ' U+1FC6 U -> 'Η' U+0397 '͂' U+0342
'ῇ' U+1FC7 U -> 'Η' U+0397 '͂' U+0342 'Ι' U+0399
'ῌ' U+1FCC U -> 'Η' U+0397 'Ι' U+0399
'ῒ' U+1FD2 U -> 'Ι' U+0399 '̈' U+0308 '̀' U+0300
'ΐ' U+1FD3 U -> 'Ι' U+0399 '̈' U+0308 '́' U+0301
'ῖ' U+1FD6 U -> 'Ι' U+0399 '͂' U+0342
'ῗ' U+1FD7 U -> 'Ι' U+0399 '̈' U+0308 '͂' U+0342
'ῢ' U+1FE2 U -> 'Υ' U+03A5 '̈' U+0308 '̀' U+0300
'ΰ' U+1FE3 U -> 'Υ' U+03A5 '̈' U+0308 '́' U+0301
'ῤ' U+1FE4 U -> 'Ρ' U+03A1 '̓' U+0313
'ῦ' U+1FE6 U -> 'Υ' U+03A5 '͂' U+0342
'ῧ' U+1FE7 U -> 'Υ' U+03A5 '̈' U+0308 '͂' U+0342
'ῲ' U+1FF2 U -> 'Ὼ' U+1FFA 'Ι' U+0399
'ῳ' U+1FF3 U -> 'Ω' U+03A9 'Ι' U+0399
'ῴ' U+1FF4 U -> 'Ώ' U+038F 'Ι' U+0399
'ῶ' U+1FF6 U -> 'Ω' U+03A9 '͂' U+0342
'ῷ' U+1FF7 U -> 'Ω' U+03A9 '͂' U+0342 'Ι' U+0399
'ῼ' U+1FFC U -> 'Ω' U+03A9 'Ι' U+0399
'ff' U+FB00 U -> 'F' U+0046 'F' U+0046
'fi' U+FB01 U -> 'F' U+0046 'I' U+0049
'fl' U+FB02 U -> 'F' U+0046 'L' U+004C
'ffi' U+FB03 U -> 'F' U+0046 'F' U+0046 'I' U+0049
'ffl' U+FB04 U -> 'F' U+0046 'F' U+0046 'L' U+004C
'ſt' U+FB05 U -> 'S' U+0053 'T' U+0054
'st' U+FB06 U -> 'S' U+0053 'T' U+0054
'ﬓ' U+FB13 U -> 'Մ' U+0544 'Ն' U+0546
'ﬔ' U+FB14 U -> 'Մ' U+0544 'Ե' U+0535
'ﬕ' U+FB15 U -> 'Մ' U+0544 'Ի' U+053B
'ﬖ' U+FB16 U -> 'Վ' U+054E 'Ն' U+0546
'ﬗ' U+FB17 U -> 'Մ' U+0544 'Խ' U+053D
-----
Found 1 chars with expanded lowercase conversions.
Found 102 chars with expanded uppercase conversions.
Note that this does not take locale into account, which could change the output.
which seems to be an iterator with always one element.
Not always. There are some cases, when a single character represents a lowercase symbol, whereas uppercase symbol represented by two characters.
Those cases covered in SpecialCasing
Unicode documentation. Quote from the Rust documentation:
This performs complex unconditional mappings with no tailoring: it maps one Unicode character to its lowercase equivalent according to the Unicode database and the additional complex mappings SpecialCasing.txt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With