Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I count unique grapheme clusters in a string in Rust?

Tags:

unicode

rust

For example, for

let n = count_unique_grapheme_clusters("πŸ‡§πŸ‡· πŸ‡·πŸ‡Ί πŸ‡§πŸ‡· πŸ‡ΊπŸ‡Έ πŸ‡§πŸ‡·");
println!("{}", n);

the expected output is (space and three flags: " ", "πŸ‡§πŸ‡·", "πŸ‡·πŸ‡Ί", "πŸ‡ΊπŸ‡Έ"):

4
like image 299
ozkriff Avatar asked Aug 13 '18 08:08

ozkriff


1 Answers

We can use the graphemes method from unicode-segmentation crate to iterate over the grapheme clusters and save them in a HashSet<&str> to filter out the duplicates. Then we get the .len() of the container.

extern crate unicode_segmentation; // 1.2.1

use std::collections::HashSet;

use unicode_segmentation::UnicodeSegmentation;

fn count_unique_grapheme_clusters(s: &str) -> usize {
    let is_extended = true;
    s.graphemes(is_extended).collect::<HashSet<_>>().len()
}

fn main() {
    assert_eq!(count_unique_grapheme_clusters(""), 0);
    assert_eq!(count_unique_grapheme_clusters("a"), 1);
    assert_eq!(count_unique_grapheme_clusters("πŸ‡ΊπŸ‡Έ"), 1);
    assert_eq!(count_unique_grapheme_clusters("πŸ‡·πŸ‡Ίé"), 2);
    assert_eq!(count_unique_grapheme_clusters("πŸ‡§πŸ‡·πŸ‡·πŸ‡ΊπŸ‡§πŸ‡·πŸ‡ΊπŸ‡ΈπŸ‡§πŸ‡·"), 3);
}

Playground

like image 149
ozkriff Avatar answered Sep 24 '22 21:09

ozkriff