Every programming language has their own interpretation of <code>\n</code> and <code>\r</code>. Unicode supports multiple characters that can represent a new line. From the Rust reference: <blockquote> A whitespace escape is one of the characters U+006E (n), U+0072 (r), or U+0074 (t), denoting the Unicode values U+000A (LF), U+000D (CR) or U+0009 (HT) respectively. </blockquote> Based on that statement, I'd say a Rust character is a new-line character if it is either <code>\n</code> or <code>\r</code>. On Windows it might be the combination of <code>\r</code> and <code>\n</code>. I'm not sure though. What about the following? <ul> <li>Next line character (U+0085)</li> <li>Line separator character (U+2028)</li> <li>Paragraph separator character (U+2029)</li> </ul> In my opinion, we are missing something like a <code>char.is_new_line()</code>. I looked through the Unicode Character Categories but couldn't find a definition for new-lines. Do I have to come up with my own definition of what a Unicode new-line character is?

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like <code>$</code> against a string like <code>\r\r\n\n</code> in multi-line-mode: Are there two lines (<code>\r\r\n</code>, <code>\n</code>), three lines (<code>\r</code>, <code>\r\n</code>, <code>\n</code>, like Unicode says) or four (<code>\r</code>, <code>\r</code>, <code>\n</code>, <code>\n</code>, like JS sees it)? Go and Python do not treat <code>\r\n</code> as a single <code>$</code> and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters. So the takeaway here is <ul> <li>It is agreed upon that <code>\n</code> is a newline</li> <li> <code>\r\n</code> may be a single newline</li> <li>unless <code>\r\n</code> is treated as two newlines</li> <li>unless <code>\r\n</code> is "some character followed by a newline"</li> <li>You shall not have any more newlines beside that.</li> </ul> If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses <code>\t</code> instead as well. Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section <code>LB5</code> for why <code>\r\r\n</code> should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

1 Answers

There is considerable practical disagreement between languages like Java, Python, Go and JavaScript as to what constitutes a newline-character and how that translates to "new lines". The disagreement is demonstrated by how the batteries-included regex engines treat patterns like $ against a string like \r\r\n\n in multi-line-mode: Are there two lines (\r\r\n, \n), three lines (\r, \r\n, \n, like Unicode says) or four (\r, \r, \n, \n, like JS sees it)? Go and Python do not treat \r\n as a single $ and neither does Rust's regex crate; Java's does however. I don't know of any language whose batteries extend newline-handling to any more Unicode characters.

So the takeaway here is

It is agreed upon that \n is a newline
\r\n may be a single newline
unless \r\n is treated as two newlines
unless \r\n is "some character followed by a newline"
You shall not have any more newlines beside that.

If you really need more Unicode characters to be treated as newlines, you'll have to define a function that does that for you. Don't expect real-world input that expects that. After all, we had the ASCII Record separator for a gazillion years and everybody uses \t instead as well.

Update: See http://www.unicode.org/reports/tr14/tr14-32.html#BreakingRules section LB5 for why \r\r\n should be treated as two line breaks. You could read the whole page to get a grip on how your original question would have to be implemented. My guess is by the point you reach "South East Asian: line breaks require morphological analysis" you'll close the tab :-)

146

answered Oct 21 '22 20:10

user2722968

Related questions
                            
                                How can I open files containing accents in Java?
                            
                                Windows Console and Qt Unicode Text
                            
                                Sublime 3 - show bad hidden characters
                            
                                Using json.dumps with ensure_ascii=True
                            
                                Regex to Match Horizontal White Spaces
                            
                                how to deal with unicode in mako?
                            
                                Delphi XE2 Dataset field type TStringField does not support Unicode?
                            
                                Possible values for __STDC_ISO_10646__
                            
                                UTF-16 safe substring in C# .NET
                            
                                findstr or grep that autodetects chararacter encoding (UTF-16)
                            
                                An equivalent to string.ascii_letters for unicode strings in python 2.x?
                            
                                How to input Unicode character in Rails console?
                            
                                Unicode filenames on Windows with Python & subprocess.Popen()
                            
                                A resilient, actually working CSV implementation for non-ascii?
                            
                                Monospaced font/symbols for JTextPane
                            
                                json.dump - UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte
                            
                                Decoding if it's not unicode
                            
                                Delphi WideString and Delphi 2009+
                            
                                UnicodeDecodeError on join
                            
                                python 2.7 string.join() with unicode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I check if a character is a Unicode new-line character (not only ASCII) in Rust?

Tags:

newline

unicode

carriage-return

rust

linefeed

Noel Widmer

People also ask

1 Answers

user2722968

Recent Activity

Donate For Us