What's the best way to embed a Unicode character in a POSIX shell script?

Question

There's several shell-specific ways to include a ‘unicode literal’ in a string. For instance, in Bash, the quoted string-expanding mechanism, $'', allows us to directly embed an invisible character: $'\u2620'.

However, if you're trying to write universally cross-platform shell-scripts (generally, this can be truncated to “runs in Bash, Zsh, and Dash.”), that's not a portable feature.

I can portably achieve anything in the ASCII table (octal number-space) with a construct like the following:

WHAT_A_CHARACTER="$(printf '\036')"

… however, POSIX / Dash printf only supports octal escapes.

I can also obviously achieve the full Unicode space by farming the task out to a fuller programming environment:

OH_CAPTAIN_MY_CAPTAIN="$(ruby -e 'print "\u2388"')"
TAKE_ME_OUT_TONIGHT="$(node -e 'console.log("\u266C")')"

So: what's the best way to encode such a character into a shell-script, that:

Works in dash, bash, and zsh,
shows the hexadecimal encoding of the codepoint in the code,
isn't dependant on the particular encoding of the string (i.e. not by encoding UTF-8 bytes in octal)
and finally, doesn't require the invocation of any “heavy” interpreter. (let's say, less than 0.01s runtime.)

rici · Accepted Answer

If you have Gnu printf installed (it's in debian package coreutils, for example), then you can use it independent of which shell you are using by avoiding the shell's builtin:

env printf '\u2388
'

Here I am using the Posix-standard env command to avoid the use of the printf builtin, but if you happen to know where printf is you could do this directly by using the complete, path, such as

/usr/bin/printf '\u2388
'

If both your external printf and your shell's builtin printf only implement the Posix standard, you need to work harder. One possibility is to use iconv to translate to UTF-8, but while the Posix standard requires that there be an iconv command, it does not in any way prescribe the way standard encodings are named. I think the following will work on most Posix-compatible platforms, but the number of subshells created might be sufficient to make it less efficient than a "heavy" script interpreter:

printf $(printf '\%o' $(printf %08x 0x2388 | sed 's/../0x& /g')) |
iconv -f UTF-32BE -t UTF-8

The above uses the printf builtin to force the hexadecimal codepoint value to be 8 hex digits long, then sed to rewrite them as 4 hex constants, then printf again to change the hex constants into octal notation and finally another printf to interpret the octal character constants into a four-byte sequence which can be fed into iconv as big-endian UTF-32. (It would be simpler with a printf which recognizes \x escape codes, but Posix doesn't require that and dash doesn't implement it.)

You can use the line without modification to print more than one symbol, as long as you provide the Unicode codepoints (as integer constants) for all of them (example executed in dash):

$ printf $(printf '\%o' $(printf %08x 0x2388 0x266c 0xA |
>                          sed 's/../0x& /g')) |
> iconv -f UTF-32BE -t UTF-8
⎈♬
$

Note: As Geoff Nixon mentions in a comment, the fish shell (which is nowhere close to Posix standard, and as far as I can see has no aspirations to conform) will complain about the unquoted %08x format argument to printf, because it expects words starting with % to be jobspecs. So if you use fish, add quotes to the format argument.

What's the best way to embed a Unicode character in a POSIX shell script?

Tags:

bash

shell

posix

unicode

dash-shell

ELLIOTTCABLE

1 Answers

rici

Recent Activity

Donate For Us

What's the best way to embed a Unicode character in a POSIX shell script?

Tags:

bash

shell

posix

unicode

dash-shell

ELLIOTTCABLE

1 Answers

rici

Related questions

Recent Activity

Donate For Us