I'm writing a simple full text search library, and need case folding to check if two words are equal. For this use case, the existing .to_lowercase()
and .to_uppercase()
methods are not enough.
From a quick search of crates.io, I can find libraries for normalization and word splitting but not case folding. regex-syntax
does have case folding code, but it's not exposed in its API.
For my use case, I've found the caseless crate to be most useful.
As far as I know, this is the only library which supports normalization. This is important when you want e.g. "㎒" (U+3392 SQUARE MHZ) and "mhz" to match. See Chapter 3 - Default Caseless Matching in the Unicode Standard for details on how this works.
Here's some example code that matches a string case-insensitively:
extern crate caseless;
use caseless::Caseless;
let a = "100 ㎒";
let b = "100 mhz";
// These strings don't match with just case folding,
// but do match after compatibility (NFKD) normalization
assert!(!caseless::default_caseless_match_str(a, b));
assert!(caseless::compatibility_caseless_match_str(a, b));
To get the case folded string directly, you can use the default_case_fold_str
function:
let s = "Twilight Sparkle ちゃん";
assert_eq!(caseless::default_case_fold_str(s), "twilight sparkle ちゃん");
Caseless doesn't expose a corresponding function that normalizes as well, but you can write one using the unicode-normalization crate:
extern crate unicode_normalization;
use caseless::Caseless;
use unicode_normalization::UnicodeNormalization;
fn compatibility_case_fold(s: &str) -> String {
s.nfd().default_case_fold().nfkd().default_case_fold().nfkd().collect()
}
let a = "100 ㎒";
assert_eq!(compatibility_case_fold(a), "100 mhz");
Note that multiple rounds of normalization and case folding are needed for a correct result.
(Thanks to BurntSushi5 for pointing me to this library.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With