I have been lately using unicode more often and wondered if there is a command line tool to convert unicode between its forms.
Would be nice to be able to say:
uni_convert "☃" --string
And know that the string is defined in unicode as a "SNOWMAN".
Perl's Unicode-Tussle distribution comes with the useful uniprops
.
$ uniprops '☃'
U+2603 ‹☃› \N{SNOWMAN}
...
$ uniprops 'U+2603'
U+2603 ‹☃› \N{SNOWMAN}
...
$ uniprops 'SNOWMAN'
U+2603 ‹☃› \N{SNOWMAN}
...
If you're writing code, you'll want charnames.
Want | Have | Code |
---|---|---|
$code |
$char |
ord($char) |
$code |
$name |
charnames::vianame($name) |
$char |
$code |
chr($code) |
$char |
$name |
chr(charnames::vianame($name)) |
$name |
$code |
charnames::viacode($code) |
$name |
$char |
charnames::viacode(ord($char)) |
vianame
accepts official aliases (e.g. LF
for LINEFEED
). You'll need to parse U+
notation yourself if wish to accept it. ($code = hex(s/^U\+//r);
)
Example:
use strict;
use warnings;
use feature qw( say );
use experimental qw( regex_sets ); # Safe. Optional since 5.36.
use utf8; # Source encoded using UTF-8.
use open ":std", ":encoding(UTF-8)"; # Terminal provides/expects UTF-8.
use charnames qw( :full );
use Encode qw( decode_utf8 );
@ARGV == 1
or die("usage\n");
my $s = decode_utf8($ARGV[0]);
for my $cp ( unpack "W*", $s ) {
my $ch = chr($cp);
if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) { # Not sure if good enough.
printf "‹%s› ", $ch;
} else {
print "--- ";
}
printf "U+%X ", $cp;
say charnames::viacode($cp);
}
$ uni_id ☃
‹☃› U+2603 SNOWMAN
$ uni_id çà
‹ç› U+E7 LATIN SMALL LETTER C WITH CEDILLA
‹à› U+E0 LATIN SMALL LETTER A WITH GRAVE
Other resources:
Unicode::UCD
Provides accsess at the information found in the Unicode Character Database.
The Unicode Standard is more than characters and properties.
perluniprops
unichars
from Unicode-Tussle (e.g. unichars '\p{Hiragana}'
)
Here is an awk to do that.
Download this file from unicode.org that provides the latest names.
Then:
q=$(printf '%x\n' \'☃)
awk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
}
END{ print names[q] }
' q="$q" names.txt
Prints:
SNOWMAN
If you want to go the other way:
cp=$(awk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
other_names[str]=$1
}
END{ print other_names[q] }
' q="SNOWMAN" names.txt)
echo -e "\u${cp}"
Prints:
☃
If you have GNU awk you can easily convert the hex index into decimal and can print from within. This allows a single source file to be used and go one way or the other by defining q
or r
:
gawk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' r='SNOWMAN' names.txt
☃
gawk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' q=$(printf '%x\n' \'☃) names.txt
SNOWMAN
I separated the code into a file and created a repo: https://github.com/poti1/uni_convert
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With