Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slice a string containing Unicode chars

Tags:

I have a piece of text with characters of different bytelength.

let text = "Hello привет"; 

I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this

let slice = &text[start..end]; 

and got the following error

thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`' 

I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:

slice = text[start:end] ?

I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

like image 348
Sasha Tsukanov Avatar asked Aug 23 '18 09:08

Sasha Tsukanov


People also ask

Can slice be used on strings?

slice() extracts the text from one string and returns a new string. Changes to the text in one string do not affect the other string.

What is the difference between the slice () and substring () method?

Differences between substring() and slice()If either or both of the arguments are negative or NaN , the substring() method treats them as if they were 0 . slice() also treats NaN arguments as 0 , but when it is given negative values it counts backwards from the end of the string to find the indexes.

What is Unicode string example?

Encodings. To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence of code points needs to be represented in memory as a set of code units, and code units are then mapped to 8-bit bytes.

Does string slice mutate?

slice is a string method too. Similar to the array method, the slice method for strings also slices a section of the string and returns it. It does not mutate the original string.


2 Answers

Possible solutions to codepoint slicing

I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

If you know the exact byte indices, you can slice a string:

let text = "Hello привет"; println!("{}", &text[2..10]); 

This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):

let text = "Hello привет"; let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap(); println!("{}", &text[2..end]); 

As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.

let text = "Hello привет"; let text_vec = text.chars().collect::<Vec<_>>(); println!("{}", text_vec[2..8].iter().cloned().collect::<String>()); 

Why is this not easier?

As you can see, neither of these solutions is all that great. This is intentional, for two reasons:

As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).

But the more important reason:

Unicode codepoints are generally not a useful unit

What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).

But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:

>>> s = "Jürgen" >>> s[0:2] 'Ju' 

Surprising, right? This is because the string above is:

  • 0x004A LATIN CAPITAL LETTER J
  • 0x0075 LATIN SMALL LETTER U
  • 0x0308 COMBINING DIAERESIS
  • ...

This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.

Another example:

>>> s = "fire" >>> s[0:2] 'fir' 

Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.

There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.

So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.


Further resources on this topic:

  • Blogpost "Let's stop ascribing meaning to unicode codepoints"
  • Blogpost "Breaking our Latin-1 assumptions
  • http://utf8everywhere.org/
like image 131
Lukas Kalbertodt Avatar answered Sep 19 '22 13:09

Lukas Kalbertodt


A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.

You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):

fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {     assert!(end >= start);     string.char_indices().nth(start).and_then(|(start_pos, _)| {         string[start_pos..]             .char_indices()             .nth(end - start - 1)             .map(|(end_pos, _)| &string[start_pos..end_pos])     }) } 

playground

You may use str::chars() if you are fine with getting a String:

let string: String = text.chars().take(end).skip(start).collect(); 
like image 25
Tim Diekmann Avatar answered Sep 18 '22 13:09

Tim Diekmann