Consider the following examples (位>
= ghci, $
= shell):
位> writeFile "d" $ show "d"
$ cat d
"d"
位> writeFile "d" "d"
$ cat d
d
位> writeFile "backslash" $ show "\\"
$ cat backslash
"\\"
位> writeFile "backslash" "\\"
$ cat backslash
\
位> writeFile "cat" $ show "馃悎" -- U+1F408
$ cat cat
"\128008"
位> writeFile "cat" "馃悎"
$ cat cat
馃悎
I understand that another way of "\128008"
is just another way of representing
"馃悎"
in Haskell source code.
My question is: why does the "馃悎"
example behave like the backslash instead of
like "d"
? Since it is a printable character, shouldn't it behave like
a letter?
More generally, what is the rule to determine whether the character will be shown as a printable character or as an escape code? I looked at Section 6.3 in the Haskell 2010 Language report but it doesn't specify the exact behaviour.
show
n as graphic characters.* Everything else will be escaped.* Except for double quotes (as they're used for string delimiters) and backslashes (because they're needed for escaping).
Let's have a look at the source code to figure this one out!
Since we have String = [Char]
, we should hunt for instance Show Char
in
the source. It can be found
here.
It is defined as:
-- | @since 2.01
instance Show Char where
showsPrec _ '\'' = showString "'\\''"
showsPrec _ c = showChar '\'' . showLitChar c . showChar '\''
showList cs = showChar '"' . showLitString cs . showChar '"'
So showing a String
(using showList
) is basically a wrapper around
ShowLitString
, and showing a Char
is a wrapper around ShowLitChar
.
Let's look at those functions.
showLitString :: String -> ShowS
-- | Same as 'showLitChar', but for strings
-- It converts the string to a string using Haskell escape conventions
-- for non-printable characters. Does not add double-quotes around the
-- whole thing; the caller should do that.
-- The main difference from showLitChar (apart from the fact that the
-- argument is a string not a list) is that we must escape double-quotes
showLitString [] s = s
showLitString ('"' : cs) s = showString "\\\"" (showLitString cs s)
showLitString (c : cs) s = showLitChar c (showLitString cs s)
-- [explanatory comments ...]
As you might've expected, showLitString
is mostly a wrapper around
showLitChar
.
[Note: If you're unfamiliar with the ShowS
type, this is a good
answer to understand why
it might be useful.]
Not quite what we were looking for, so let us go to showLitChar
(I've
omitted parts of the definition which aren't relevant to the question).
-- | Convert a character to a string using only printable characters,
-- using Haskell source-language escape conventions. For example:
-- [...]
showLitChar :: Char -> ShowS
showLitChar c s | c > '\DEL' = showChar '\\' (protectEsc isDec (shows (ord c)) s)
-- ^ Pattern matched for cat
showLitChar '\DEL' s = showString "\\DEL" s
showLitChar '\\' s = showString "\\\\" s
-- ^ Pattern matched for backslash
showLitChar c s | c >= ' ' = showChar c s
-- ^ Pattern matched for d
-- Some more escape codes
showLitChar '\a' s = showString "\\a" s
-- similarly for '\b', '\f', '\n', '\r', '\t', '\v' etc.
-- showLitChar ... = ...
Now you see where the problem is. ord c
is an int
, and the first is taken
for all non-ASCII characters (ord '\DEL' == 127
).
For characters in the ASCII range, the printable characters are printed and
the rest are escaped. For characters outside it, all of them are escaped.
The code doesn't answer the "why" part of the question. The answer to that (I think) is in the very first comment that we saw:
-- | @since 2.01
instance Show Char where
If I were guessing, this behaviour has been kept around for maintain backwards
compatibility. I don't need to guess: see the comments for some good answers to this.
We can do a git blame
online using GHC's Github mirror ;). Let's see
when this code was written
(blame link).
The relevant commit is 15 years old (!). However, it does mention Unicode.
The functionality to distinguish between different types of Unicode characters
is present in the Data.Char
module. Looking at the source:
isPrint c = iswprint (ord c) /= 0
foreign import ccall unsafe "u_iswprint"
iswprint :: Int -> Int
If you trace the commit which introduced iswprint
, you'll land up
here. That commit was made 13 years ago.
Maybe there was sufficient code written in those two years which they didn't
want to break? I don't know. If some GHC developer could shed more light on this,
that'd be awesome :). Daniel Wagner and Paul Johnson in the comments have pointed out a very good reason for this - operating with non-Unicode systems must've been a high priority (~15 years ago) as Unicode was relatively new back then.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With