Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to shuffle a str in place

Tags:

rust

I want to shuffle a String in place in Rust, but I seem to miss something. The fix is probably trivial...

use std::rand::{Rng, thread_rng};

fn main() {
    // I want to shuffle this string...
    let mut value: String = "SomeValue".to_string();
    let mut bytes = value.as_bytes();
    let mut slice: &mut [u8] = bytes.as_mut_slice();

    thread_rng().shuffle(slice);

    println!("{}", value); 
}

The error I get is

<anon>:8:36: 8:41 error: cannot borrow immutable dereference of `&`-pointer `*bytes` as mutable
<anon>:8         let mut slice: &mut [u8] = bytes.as_mut_slice();
                                            ^~~~~

I read about String::as_mut_vec() but it's unsafe so I'd rather not use it.

like image 438
Melle Avatar asked Dec 05 '22 23:12

Melle


1 Answers

There's no very good way to do this, partly due to the nature of the UTF-8 encoding of strings, and partly due to the inherent properties of Unicode and text.

There's at least three layers of things that could be shuffled in a UTF-8 string:

  • the raw bytes
  • the encoded codepoints
  • the graphemes

Shuffling raw bytes is likely to give an invalid UTF-8 string as output unless the string is entirely ASCII. Non-ASCII characters are encoded as special sequences of multiple bytes, and shuffling these will almostly certainly not get them in the right order at the end. Hence shuffling bytes is often not good.

Shuffling codepoints (char in Rust) makes a little bit more sense, but there is still the concept of "special sequences", where so-called combining characters can be layered on to a single letter adding diacritics etc (e.g. letters like ä can be written as a plus U+0308, the codepoint representing the diaeresis). Hence shuffling characters won't give an invalid UTF-8 string, but it may break up these codepoint sequences and give nonsense output.

This brings me to graphemes: the sequences of codepoints that make up a single visible character (like ä is still a single grapheme when written as one or as two codepoints). This will give the most reliably sensible answer.

Then, once you've decided which you want to shuffle the shuffling strategy can be made:

  • if the string is guaranteed to be purely ASCII, shuffling the bytes with .shuffle is sensible (with the ASCII assumption, this is equivalent to the others)
  • otherwise, there's no standard way to operate in-place, one would get the elements as an iterator (.chars() for codepoints or .graphemes(true) for graphemes), place them into a vector with .collect::<Vec<_>>(), shuffle the vector, and then collect everything back into a new String with e.g. .iter().map(|x| *x).collect::<String>().

The difficulty of handling codepoints and graphemes is because UTF-8 does not encode them as fixed width, so there's no way to take a random codepoint/grapheme out and insert it somewhere else, or otherwise swap two elements efficiently... Without just decoding everything into an external Vec.

Not being in-place is unfortunate, but strings are hard.

(If your strings are guaranteed to be ASCII, then using a type like the Ascii provided by ascii would be a good way to keep things straight, at the type-level.)


As an example of the difference of the three things, take a look at:

fn main() {
    let s = "U͍̤͕̜̲̼̜n̹͉̭͜ͅi̷̪c̠͍̖̻o̸̯̖de̮̻͍̤";
    println!("bytes: {}", s.bytes().count());
    println!("chars: {}", s.chars().count());
    println!("graphemes: {}", s.graphemes(true).count());
}

It prints:

bytes: 57
chars: 32
graphemes: 7

(Generate your own, it demonstrates putting multiple combining character on to a single letter.)

like image 111
huon Avatar answered Jan 18 '23 21:01

huon