Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to display whitespace characters using Unicode for debugging/editing

I want to display whitespace characters while debugging or editing text by replacing them with sensible Unicode code points and colouring them grey instead of black.

For example, I would like to replace

  • SPACE U+0020 with MIDDLE DOT·U+00B7
  • NO-BREAK SPACE U+00A0 with MEDIUM SMALL WHITE CIRCLEU+26AC
  • RIGHTWARDS ARROWU+2192 for TAB U+0009.
  • and so on...

I'm looking for sensible glyphs for:

  • CARRIAGE RETURN U+000D
  • newline/LINE FEED U+000A.

I don't want to use the PILCROW SIGNU+00B6 as it doesn't intuitively correspond to either but rather the concept of a new paragraph. There is also DOWNWARDS ARROW WITH CORNER LEFTWARDSU+21B5 but again, it seem more like a combination symbol than representing either one individually.

When I have mixed line endings I want to be able to see which character is being used (or both). I am displaying the output in HTML in a browser.

Currently I can't think of any better symbols than: - LEFTWARDS ARROWU+2190 for carriage return - DOWNWARDS ARROWU+2193 for newline.

I am aware of SYMBOL FOR CARRIAGE RETURNU+240D, SYMBOL FOR LINE FEEDU+240A and SYMBOL FOR NEWLINEU+2424 but the detail is hard to see on them.

I also don't want to use \r and \n for two reasons, r and n look a little similar (not much, but a little) and it takes two characters to display them instead of one. However, if I don't get any better suggestions I might alternatively use DOWNWARDS ARROW WITH CORNER LEFTWARDSU+21B5 for carriage return and RIGHTWARDS ARROW WITH CORNER DOWNWARDSU+21B4 for newline.

like image 350
CJ Dennis Avatar asked Mar 21 '15 07:03

CJ Dennis


1 Answers

As you've said, U+21B5 (↵) is a good choice for carriage return. Note that it is the symbol on your enter key, which has been in use for this since the days of electric typewriters. This is also where the name comes from, since it would literally return the carriage holding the paper and moving it under the ink ribbon head. As such I think it has become ingrained enough in users of keyboards to be intuitively recognizable.

Since you've noted concerns regarding visibility, however, consider U+23CE (⏎). This symbol is part of the UNICODE standard for the express purpose of representing a return; but it might be interpreted as meaning a new line in general, which is often a combination of a carriage return and line feed (depending on the system).

U+21B5 (↵) is part of the UNICODE arrows block, while U+23CE (⏎) is part of the "miscellaneous technical" block. That second one is closer to what seems useful for technical considerations like yours, rather than a regular arrow.

That leaves us with the line feed. When you start to think about what it actually is, even the choice for the return arrow becomes questionable. A line feed is basically an instruction for moving down a line. A carriage return simply moves the caret ("carriage") back to the start of a line. A line feed doesn't have to be combined with a carriage return, nor does a carriage return actually have to be combined with a line feed (although it is normally senseless not to). On typewriters this starts making sense. After typing a line you would swing the carriage back to the start, then scroll the paper upwards. Basically a carriage return + line feed. Now you see why "new line" might make sense as a combination of these two for historical purposes, and why they can be used in either order. Technically you can do a line feed without carriage return and continue typing in the column where you left off at the previous line. The reason this brings our ↵/⏎ into question is that the symbol seems to imply a carriage return AND line feed. Indeed, on electrical typewriters and word processors it normally results in a full new line.

So, how to represent line feed? An arrow pointing down seems like the intuitive choice, but then we might need to rethink our carriage return as well. U+21E9 (downwards white arrow, ⇩) is visually (likely, given that glyphs may vary) the most congruent with ⏎. But if we're going with that, you might as well use U+21E6 (leftwards white arrow, ⇦) for your carriage return.

What to choose with so much options? Well, personally I think the choice that is technically superior are the characters from the UNICODE "control pictures" block. These are the U+240A (␊) for line feed, and U+240D (␍) for carriage return. They also appeal to the programmer in me because the last byte of the code point for either corresponds to the ASCII code for them. But I understand that they can be hard to make out on screen and usability may be more important. But lots of text editors go with some variation of this when asked to show all symbols.

So I'd say the options are...

  • ␊ and ␍ for being most technically correct.
  • ⇩ and ⇦ for the most visual clarity, being in the same code block and likely to be consistent in presentation for a given font.
  • ↵ or ⏎ as carriage return for being the most easily recognizable, and then some other option for line feeds; but this is also possibly the most confusing, since the angled arrow really kind of implies carriage return + line feed.

Also make sure you pick something that is likely to be properly shown in the majority of browsers, with the varying default fonts on various browsers and systems. I noticed some of the code points for supplemental blocks didn't show up when I went through the UTF-8 table.

Finally, one remark. Is it necessary to use UNICODE symbols? Notepad++, my favourite text editor, uses big "CR" and "LF" symbols on a gray background when all symbols are visualized. Perhaps you can simply use images (preferably scaled according to the font size in your CSS)?

like image 195
G_H Avatar answered Nov 17 '22 18:11

G_H