Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing a character in a Rust string using indexing

I want to read strings from "input.txt" and leave only those which have no # (comment) symbol at the start of the line. I wrote this code:

use std::io::{BufRead, BufReader};
use std::fs::File;

fn main() {
    let file = BufReader::new(File::open("input.txt").unwrap());
    let lines: Vec<String> = file.lines().map(|x| x.unwrap()).collect();
    let mut iter = lines.iter().filter(|&x| x.chars().next() != "#".chars().next());
    println!("{}", iter.next().unwrap());
}

But this line

|&x| x.chars().next() != "#".chars().next()

smells bad to me, because it can look like this |x| x[0] == "#" and I can't check the second character in the string.

So how I can refactor this code?

like image 249
Pavlo Razumovskyi Avatar asked Oct 13 '14 18:10

Pavlo Razumovskyi


People also ask

Can you index a string in Rust?

Indexing into a string is often a bad idea because it's not clear what the return type of the string-indexing operation should be: a byte value, a character, a grapheme cluster, or a string slice. It's one of the reasons why the Rust compiler does not allows the direct access to characters in strings.

How do you compare strings in Rust?

One of the common operations on strings is comparison. We can use the eq(), eq_ignore_ascii_case() and == to compare strings in Rust.

How do you find the length of a string in Rust?

Overview. The length of a string is the number of characters that make up that string. We can get the length of a string in Rust using the len() method.


1 Answers

Rust strings are stored as a sequence of bytes representing characters in UTF-8 encoding. UTF-8 is a variable-width encoding, so byte indexing can leave you inside a character, which is obviously unsafe. But getting a code point by index is an O(n) operation. Moreover, indexing code points is not what you really want to do, because there are code points which do not even have associated characters, like diacritics or other modifiers. Indexing grapheme clusters is closer to the correct approach, but is is usually needed in text rendering or, probably, language processing.

What I mean is that indexing a string is hard to define properly, and what most people usually want is wrong. Hence Rust does not provide a generic index operation on strings.

Occasionally, however, you do need to index strings. For example, if you know in advance that your string contains only ASCII characters or if you are working with binary data. In this case Rust, of course, provides all necessary means.

First, you can always obtain a view of the underlying sequence of bytes. &str has as_bytes() method which returns &[u8], a slice of bytes the string consists of. Then you can use usual indexing operation:

x.as_bytes()[0] != b'#'

Note the special notation: b'#' means "ASCII character # of type u8", i.e. it is a byte character literal (also note that you don't need to write "#".chars().next() to get character #, you can just write '#' - a plain character literal). This is unsafe, however, because &str is UTF-8-encoded string and the first character can consist of more than one byte.

The proper way to handle ASCII data in Rust is to use the ascii crate. You can go from &str to &AsciiStr with the as_ascii_str() method. Then you can use it like this:

extern crate ascii;
use ascii::{AsAsciiStr, AsciiChar};

// ...

x.as_ascii_str().unwrap()[0] != AsciiChar::Hash

This way you will need slightly more typing but you will get much more safety in return, because as_ascii_str() checks that you work with ASCII data only.

Sometimes, however, you just want to work with binary data, without really interpreting it as characters, even if the source contains some ASCII characters. This can happen, for example, when you're writing a parser for some markup language like Markdown. In this case you can treat the whole input as a sequence of bytes:

use std::io::{Read, BufReader};
use std::fs::File;

fn main() {
    let mut file = BufReader::new(File::open("/etc/hosts").unwrap());
    let mut buf = Vec::new();
    file.read_to_end(&mut buf).unwrap();
    let mut iter = buf.split(|&c| c == b'\n').filter(|line| line[0] != b'#');
    println!("{:?}", iter.next().unwrap());
}
like image 188
Vladimir Matveev Avatar answered Sep 21 '22 19:09

Vladimir Matveev