Every programming language has their own interpretation of \n
and \r
.
Unicode supports multiple characters that can represent a new line.
From the Rust reference:
A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively.
Based on that statement, I'd say a Rust character is a new-line character if it is either \n
or \r
. On Windows it might be the combination of \r
and \n
. I'm not sure though.
What about the following?
In my opinion, we are missing something like a char.is_new_line()
.
I looked through the Unicode Character Categories but couldn't find a definition for new-lines.
Do I have to come up with my own definition of what a Unicode new-line character is?
Check the length of the string and size in bytes. If both are equal then it ASCII. If size in bytes is larger than length of the string, then it contains UNICODE characters.
The ASCII character code 10 is sometimes written as \n and it is sometimes called a New Line or NL . ASCII character 10 is also called a Line Feed or LF . On a UNIX based operating system such as Linux or Mac it is all you typically use to delineate a line in a file.
Simply comparing to '\n' should solve your problem; depending on what you consider to be a newline character, you might also want to check for '\r' (carriage return).
The Rust Programming language has support for Unicode characters in its core with the char primitive type, which represents a Unicode Scalar Values, that is a Unicode-version-agnostic type.
There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $
against a string like \r\r\n\n
in multi-line-mode: Are there two lines (\r\r\n
, \n
), three lines (\r
, \r\n
, \n
, like Unicode says) or four (\r
, \r
, \n
, \n
, like JS sees it)? Go and Python do not treat \r\n
as a single $
and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.
So the takeaway here is
\n
is a newline\r\n
may be a single newline\r\n
is treated as two newlines\r\n
is "some character followed by a newline"If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t
instead as well.
Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5
for why \r\r\n
should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With