Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Semantics of show w.r.t. escape characters

Tags:

haskell

ghc

Consider the following examples (位> = ghci, $ = shell):

位> writeFile "d" $ show "d"
$ cat d
"d"

位> writeFile "d" "d"
$ cat d
d

位> writeFile "backslash" $ show "\\"
$ cat backslash
"\\"

位> writeFile "backslash" "\\"
$ cat backslash
\

位> writeFile "cat" $ show "馃悎" -- U+1F408
$ cat cat
"\128008"

位> writeFile "cat" "馃悎"
$ cat cat
馃悎

I understand that another way of "\128008" is just another way of representing "馃悎" in Haskell source code. My question is: why does the "馃悎" example behave like the backslash instead of like "d"? Since it is a printable character, shouldn't it behave like a letter?

More generally, what is the rule to determine whether the character will be shown as a printable character or as an escape code? I looked at Section 6.3 in the Haskell 2010 Language report but it doesn't specify the exact behaviour.

like image 253
typesanitizer Avatar asked Feb 17 '18 22:02

typesanitizer


1 Answers

TL:DR; Printable characters inside the ASCII range (0-127) will be shown as graphic characters.* Everything else will be escaped.

* Except for double quotes (as they're used for string delimiters) and backslashes (because they're needed for escaping).

Let's have a look at the source code to figure this one out!

Since we have String = [Char], we should hunt for instance Show Char in the source. It can be found here. It is defined as:

-- | @since 2.01
instance  Show Char  where
    showsPrec _ '\'' = showString "'\\''"
    showsPrec _ c    = showChar '\'' . showLitChar c . showChar '\''

    showList cs = showChar '"' . showLitString cs . showChar '"'

So showing a String (using showList) is basically a wrapper around ShowLitString, and showing a Char is a wrapper around ShowLitChar. Let's look at those functions.

showLitString :: String -> ShowS
-- | Same as 'showLitChar', but for strings
-- It converts the string to a string using Haskell escape conventions
-- for non-printable characters. Does not add double-quotes around the
-- whole thing; the caller should do that.
-- The main difference from showLitChar (apart from the fact that the
-- argument is a string not a list) is that we must escape double-quotes
showLitString []         s = s
showLitString ('"' : cs) s = showString "\\\"" (showLitString cs s)
showLitString (c   : cs) s = showLitChar c (showLitString cs s)
   -- [explanatory comments ...]

As you might've expected, showLitString is mostly a wrapper around showLitChar. [Note: If you're unfamiliar with the ShowS type, this is a good answer to understand why it might be useful.] Not quite what we were looking for, so let us go to showLitChar (I've omitted parts of the definition which aren't relevant to the question).

-- | Convert a character to a string using only printable characters,
-- using Haskell source-language escape conventions.  For example:
-- [...]
showLitChar                :: Char -> ShowS
showLitChar c s | c > '\DEL' =  showChar '\\' (protectEsc isDec (shows (ord c)) s)
-- ^ Pattern matched for cat
showLitChar '\DEL'         s =  showString "\\DEL" s
showLitChar '\\'           s =  showString "\\\\" s
-- ^ Pattern matched for backslash
showLitChar c s | c >= ' '   =  showChar c s
-- ^ Pattern matched for d
-- Some more escape codes
showLitChar '\a'           s =  showString "\\a" s
-- similarly for '\b', '\f', '\n', '\r', '\t', '\v' etc.
-- showLitChar ... = ...

Now you see where the problem is. ord c is an int, and the first is taken for all non-ASCII characters (ord '\DEL' == 127). For characters in the ASCII range, the printable characters are printed and the rest are escaped. For characters outside it, all of them are escaped.

The code doesn't answer the "why" part of the question. The answer to that (I think) is in the very first comment that we saw:

-- | @since 2.01
instance  Show Char  where

If I were guessing, this behaviour has been kept around for maintain backwards compatibility. I don't need to guess: see the comments for some good answers to this.

Bonus

We can do a git blame online using GHC's Github mirror ;). Let's see when this code was written (blame link). The relevant commit is 15 years old (!). However, it does mention Unicode.

The functionality to distinguish between different types of Unicode characters is present in the Data.Char module. Looking at the source:

isPrint    c = iswprint (ord c) /= 0

foreign import ccall unsafe "u_iswprint"
  iswprint :: Int -> Int

If you trace the commit which introduced iswprint, you'll land up here. That commit was made 13 years ago. Maybe there was sufficient code written in those two years which they didn't want to break? I don't know. If some GHC developer could shed more light on this, that'd be awesome :). Daniel Wagner and Paul Johnson in the comments have pointed out a very good reason for this - operating with non-Unicode systems must've been a high priority (~15 years ago) as Unicode was relatively new back then.

like image 148
typesanitizer Avatar answered Nov 08 '22 20:11

typesanitizer