Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I case fold a string in Rust?

Tags:

unicode

rust

I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase() and .to_uppercase() methods are not enough.

From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax does have case folding code, but it's not exposed in its API.

like image 977
Lambda Fairy Avatar asked Oct 25 '16 22:10

Lambda Fairy


1 Answers

For my use case, I've found the caseless crate to be most useful.

As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.

Here's some example code that matches a string case-insensitively:

extern crate caseless;
use caseless::Caseless;

let a = "100 ㎒";
let b = "100 mhz";

// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));

To get the case folded string directly, you can use the default_case_fold_str function:

let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");

Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:

extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;

fn compatibility_case_fold(s: &str) -> String {
    s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}

let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");

Note that multiple rounds of normalization and case folding are needed for a correct result.

(Thanks to BurntSushi5 for pointing me to this library.)

like image 158
Lambda Fairy Avatar answered Oct 04 '22 06:10

Lambda Fairy