Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the motivation of Rust's ToLowercase

Tags:

rust

Rust's char has a to_lowercase function which seems to return the struct ToLowercase which seems to be an iterator with always one element.

Wouldn't returning a char directly be far more natural and simple?

like image 645
Ronald Smith Avatar asked Mar 01 '16 06:03

Ronald Smith


Video Answer


2 Answers

Wouldn't returning a char directly be far more natural and simple?

Natural, simple, and wrong. Unicode is too complicated for that to work. The fundamental issue is that a char is not sufficient to always represent a single, logically complete "character", for some definitions of "character".

which seems to be an iterator with always one element.

This can be trivially demonstrated to be wrong by running a simple program that upper- and lower-cases every valid Unicode codepoint. The program:

/*!
Add the following to a `Cargo.toml` file:

```cargo
[dependencies]
arrayvec = "0.3.15"
```
*/
extern crate arrayvec;
use arrayvec::ArrayVec;

fn main() {
    let mut expanded_lcs = 0;
    let mut expanded_ucs = 0;

    let usvs = (0..0xd7ff).chain(0xe000..0x10ffff)
        .flat_map(|v| std::char::from_u32(v).into_iter());

    for c in usvs {
        let lc: ArrayVec<[_; 4]> = c.to_lowercase().collect();
        let uc: ArrayVec<[_; 4]> = c.to_uppercase().collect();

        if lc.len() != 1 {
            expanded_lcs += 1;
            print!("'{}' U+{:04X} L -> ", c, c as u32);
            for c in lc {
                print!("'{}' U+{:04X} ", c, c as u32);
            }
            println!("");
        }

        if uc.len() != 1 {
            expanded_ucs += 1;
            print!("'{}' U+{:04X} U -> ", c, c as u32);
            for c in uc {
                print!("'{}' U+{:04X} ", c, c as u32);
            }
            println!("");
        }
    }

    println!("\n-----\n");

    println!("Found {} chars with expanded lowercase conversions.", expanded_lcs);
    println!("Found {} chars with expanded uppercase conversions.", expanded_ucs);
}

And its output, given a rustc 1.8 nightly:

'ß' U+00DF U -> 'S' U+0053 'S' U+0053 
'İ' U+0130 L -> 'i' U+0069 '̇' U+0307 
'ʼn' U+0149 U -> 'ʼ' U+02BC 'N' U+004E 
'ǰ' U+01F0 U -> 'J' U+004A '̌' U+030C 
'ΐ' U+0390 U -> 'Ι' U+0399 '̈' U+0308 '́' U+0301 
'ΰ' U+03B0 U -> 'Υ' U+03A5 '̈' U+0308 '́' U+0301 
'և' U+0587 U -> 'Ե' U+0535 'Ւ' U+0552 
'ẖ' U+1E96 U -> 'H' U+0048 '̱' U+0331 
'ẗ' U+1E97 U -> 'T' U+0054 '̈' U+0308 
'ẘ' U+1E98 U -> 'W' U+0057 '̊' U+030A 
'ẙ' U+1E99 U -> 'Y' U+0059 '̊' U+030A 
'ẚ' U+1E9A U -> 'A' U+0041 'ʾ' U+02BE 
'ὐ' U+1F50 U -> 'Υ' U+03A5 '̓' U+0313 
'ὒ' U+1F52 U -> 'Υ' U+03A5 '̓' U+0313 '̀' U+0300 
'ὔ' U+1F54 U -> 'Υ' U+03A5 '̓' U+0313 '́' U+0301 
'ὖ' U+1F56 U -> 'Υ' U+03A5 '̓' U+0313 '͂' U+0342 
'ᾀ' U+1F80 U -> 'Ἀ' U+1F08 'Ι' U+0399 
'ᾁ' U+1F81 U -> 'Ἁ' U+1F09 'Ι' U+0399 
'ᾂ' U+1F82 U -> 'Ἂ' U+1F0A 'Ι' U+0399 
'ᾃ' U+1F83 U -> 'Ἃ' U+1F0B 'Ι' U+0399 
'ᾄ' U+1F84 U -> 'Ἄ' U+1F0C 'Ι' U+0399 
'ᾅ' U+1F85 U -> 'Ἅ' U+1F0D 'Ι' U+0399 
'ᾆ' U+1F86 U -> 'Ἆ' U+1F0E 'Ι' U+0399 
'ᾇ' U+1F87 U -> 'Ἇ' U+1F0F 'Ι' U+0399 
'ᾈ' U+1F88 U -> 'Ἀ' U+1F08 'Ι' U+0399 
'ᾉ' U+1F89 U -> 'Ἁ' U+1F09 'Ι' U+0399 
'ᾊ' U+1F8A U -> 'Ἂ' U+1F0A 'Ι' U+0399 
'ᾋ' U+1F8B U -> 'Ἃ' U+1F0B 'Ι' U+0399 
'ᾌ' U+1F8C U -> 'Ἄ' U+1F0C 'Ι' U+0399 
'ᾍ' U+1F8D U -> 'Ἅ' U+1F0D 'Ι' U+0399 
'ᾎ' U+1F8E U -> 'Ἆ' U+1F0E 'Ι' U+0399 
'ᾏ' U+1F8F U -> 'Ἇ' U+1F0F 'Ι' U+0399 
'ᾐ' U+1F90 U -> 'Ἠ' U+1F28 'Ι' U+0399 
'ᾑ' U+1F91 U -> 'Ἡ' U+1F29 'Ι' U+0399 
'ᾒ' U+1F92 U -> 'Ἢ' U+1F2A 'Ι' U+0399 
'ᾓ' U+1F93 U -> 'Ἣ' U+1F2B 'Ι' U+0399 
'ᾔ' U+1F94 U -> 'Ἤ' U+1F2C 'Ι' U+0399 
'ᾕ' U+1F95 U -> 'Ἥ' U+1F2D 'Ι' U+0399 
'ᾖ' U+1F96 U -> 'Ἦ' U+1F2E 'Ι' U+0399 
'ᾗ' U+1F97 U -> 'Ἧ' U+1F2F 'Ι' U+0399 
'ᾘ' U+1F98 U -> 'Ἠ' U+1F28 'Ι' U+0399 
'ᾙ' U+1F99 U -> 'Ἡ' U+1F29 'Ι' U+0399 
'ᾚ' U+1F9A U -> 'Ἢ' U+1F2A 'Ι' U+0399 
'ᾛ' U+1F9B U -> 'Ἣ' U+1F2B 'Ι' U+0399 
'ᾜ' U+1F9C U -> 'Ἤ' U+1F2C 'Ι' U+0399 
'ᾝ' U+1F9D U -> 'Ἥ' U+1F2D 'Ι' U+0399 
'ᾞ' U+1F9E U -> 'Ἦ' U+1F2E 'Ι' U+0399 
'ᾟ' U+1F9F U -> 'Ἧ' U+1F2F 'Ι' U+0399 
'ᾠ' U+1FA0 U -> 'Ὠ' U+1F68 'Ι' U+0399 
'ᾡ' U+1FA1 U -> 'Ὡ' U+1F69 'Ι' U+0399 
'ᾢ' U+1FA2 U -> 'Ὢ' U+1F6A 'Ι' U+0399 
'ᾣ' U+1FA3 U -> 'Ὣ' U+1F6B 'Ι' U+0399 
'ᾤ' U+1FA4 U -> 'Ὤ' U+1F6C 'Ι' U+0399 
'ᾥ' U+1FA5 U -> 'Ὥ' U+1F6D 'Ι' U+0399 
'ᾦ' U+1FA6 U -> 'Ὦ' U+1F6E 'Ι' U+0399 
'ᾧ' U+1FA7 U -> 'Ὧ' U+1F6F 'Ι' U+0399 
'ᾨ' U+1FA8 U -> 'Ὠ' U+1F68 'Ι' U+0399 
'ᾩ' U+1FA9 U -> 'Ὡ' U+1F69 'Ι' U+0399 
'ᾪ' U+1FAA U -> 'Ὢ' U+1F6A 'Ι' U+0399 
'ᾫ' U+1FAB U -> 'Ὣ' U+1F6B 'Ι' U+0399 
'ᾬ' U+1FAC U -> 'Ὤ' U+1F6C 'Ι' U+0399 
'ᾭ' U+1FAD U -> 'Ὥ' U+1F6D 'Ι' U+0399 
'ᾮ' U+1FAE U -> 'Ὦ' U+1F6E 'Ι' U+0399 
'ᾯ' U+1FAF U -> 'Ὧ' U+1F6F 'Ι' U+0399 
'ᾲ' U+1FB2 U -> 'Ὰ' U+1FBA 'Ι' U+0399 
'ᾳ' U+1FB3 U -> 'Α' U+0391 'Ι' U+0399 
'ᾴ' U+1FB4 U -> 'Ά' U+0386 'Ι' U+0399 
'ᾶ' U+1FB6 U -> 'Α' U+0391 '͂' U+0342 
'ᾷ' U+1FB7 U -> 'Α' U+0391 '͂' U+0342 'Ι' U+0399 
'ᾼ' U+1FBC U -> 'Α' U+0391 'Ι' U+0399 
'ῂ' U+1FC2 U -> 'Ὴ' U+1FCA 'Ι' U+0399 
'ῃ' U+1FC3 U -> 'Η' U+0397 'Ι' U+0399 
'ῄ' U+1FC4 U -> 'Ή' U+0389 'Ι' U+0399 
'ῆ' U+1FC6 U -> 'Η' U+0397 '͂' U+0342 
'ῇ' U+1FC7 U -> 'Η' U+0397 '͂' U+0342 'Ι' U+0399 
'ῌ' U+1FCC U -> 'Η' U+0397 'Ι' U+0399 
'ῒ' U+1FD2 U -> 'Ι' U+0399 '̈' U+0308 '̀' U+0300 
'ΐ' U+1FD3 U -> 'Ι' U+0399 '̈' U+0308 '́' U+0301 
'ῖ' U+1FD6 U -> 'Ι' U+0399 '͂' U+0342 
'ῗ' U+1FD7 U -> 'Ι' U+0399 '̈' U+0308 '͂' U+0342 
'ῢ' U+1FE2 U -> 'Υ' U+03A5 '̈' U+0308 '̀' U+0300 
'ΰ' U+1FE3 U -> 'Υ' U+03A5 '̈' U+0308 '́' U+0301 
'ῤ' U+1FE4 U -> 'Ρ' U+03A1 '̓' U+0313 
'ῦ' U+1FE6 U -> 'Υ' U+03A5 '͂' U+0342 
'ῧ' U+1FE7 U -> 'Υ' U+03A5 '̈' U+0308 '͂' U+0342 
'ῲ' U+1FF2 U -> 'Ὼ' U+1FFA 'Ι' U+0399 
'ῳ' U+1FF3 U -> 'Ω' U+03A9 'Ι' U+0399 
'ῴ' U+1FF4 U -> 'Ώ' U+038F 'Ι' U+0399 
'ῶ' U+1FF6 U -> 'Ω' U+03A9 '͂' U+0342 
'ῷ' U+1FF7 U -> 'Ω' U+03A9 '͂' U+0342 'Ι' U+0399 
'ῼ' U+1FFC U -> 'Ω' U+03A9 'Ι' U+0399 
'ff' U+FB00 U -> 'F' U+0046 'F' U+0046 
'fi' U+FB01 U -> 'F' U+0046 'I' U+0049 
'fl' U+FB02 U -> 'F' U+0046 'L' U+004C 
'ffi' U+FB03 U -> 'F' U+0046 'F' U+0046 'I' U+0049 
'ffl' U+FB04 U -> 'F' U+0046 'F' U+0046 'L' U+004C 
'ſt' U+FB05 U -> 'S' U+0053 'T' U+0054 
'st' U+FB06 U -> 'S' U+0053 'T' U+0054 
'ﬓ' U+FB13 U -> 'Մ' U+0544 'Ն' U+0546 
'ﬔ' U+FB14 U -> 'Մ' U+0544 'Ե' U+0535 
'ﬕ' U+FB15 U -> 'Մ' U+0544 'Ի' U+053B 
'ﬖ' U+FB16 U -> 'Վ' U+054E 'Ն' U+0546 
'ﬗ' U+FB17 U -> 'Մ' U+0544 'Խ' U+053D 

-----

Found 1 chars with expanded lowercase conversions.
Found 102 chars with expanded uppercase conversions.

Note that this does not take locale into account, which could change the output.

like image 112
DK. Avatar answered Oct 19 '22 07:10

DK.


which seems to be an iterator with always one element.

Not always. There are some cases, when a single character represents a lowercase symbol, whereas uppercase symbol represented by two characters.

Those cases covered in SpecialCasing Unicode documentation. Quote from the Rust documentation:

This performs complex unconditional mappings with no tailoring: it maps one Unicode character to its lowercase equivalent according to the Unicode database and the additional complex mappings SpecialCasing.txt.

like image 34
awesoon Avatar answered Oct 19 '22 07:10

awesoon