Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

Every programming language has their own interpretation of \n and \r. Unicode supports multiple characters that can represent a new line.

From the Rust reference:

A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.

Based on that statement, I'd say a Rust character is a new-line character if it is either \n or \r. On Windows it might be the combination of \r and \n. I'm not sure though.

What about the following?

  • Next line character (U+0085)
  • Line separator character (U+2028)
  • Paragraph separator character (U+2029)

In my opinion, we are missing something like a char.is_new_line(). I looked through the Unicode Character Categories but couldn't find a definition for new-lines.

Do I have to come up with my own definition of what a Unicode new-line character is?

like image 691
Noel Widmer Avatar asked Jul 09 '17 11:07

Noel Widmer


People also ask

How do I check if a char is Unicode?

Check the length of the string and size in bytes. If both are equal then it ASCII. If size in bytes is larger than length of the string, then it contains UNICODE characters.

Is New line an ASCII character?

The ASCII character code 10 is sometimes written as \n and it is sometimes called a New Line or NL . ASCII character 10 is also called a Line Feed or LF . On a UNIX based operating system such as Linux or Mac it is all you typically use to delineate a line in a file.

How can you tell if a character is a new line?

Simply comparing to '\n' should solve your problem; depending on what you consider to be a newline character, you might also want to check for '\r' (carriage return).

Does rust use Unicode?

The Rust Programming language has support for Unicode characters in its core with the char primitive type, which represents a Unicode Scalar Values, that is a Unicode-version-agnostic type.


1 Answers

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \r\r\n\n in multi-line-mode: Are there two lines (\r\r\n, \n), three lines (\r, \r\n, \n, like Unicode says) or four (\r, \r, \n, \n, like JS sees it)? Go and Python do not treat \r\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

So the takeaway here is

  • It is agreed upon that \n is a newline
  • \r\n may be a single newline
  • unless \r\n is treated as two newlines
  • unless \r\n is "some character followed by a newline"
  • You shall not have any more newlines beside that.

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t instead as well.

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \r\r\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)

like image 146
user2722968 Avatar answered Oct 21 '22 20:10

user2722968