Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Behaviour of tr -c -d while deleting bytes with values that are not characters

Tags:

shell

posix

tr

I am having trouble understanding this paragraph from the 'RATIONALE' section of http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tr.html.

The ISO POSIX-2:1993 standard had a -c option that behaved similarly to the -C option, but did not supply functionality equivalent to the -c option specified in POSIX.1-2008. This meant that historical practice of being able to specify tr -cd\000-\177 (which would delete all bytes with the top bit set) would have no effect because, in the C locale, bytes with the values octal 200 to octal 377 are not characters.

However, my test on CentOS 6.5 system seems to show that it does seem to have an effect.

$ export LC_ALL=C
$ export LANG=C
$ locale
LANG=C
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_PAPER="C"
LC_NAME="C"
LC_ADDRESS="C"
LC_TELEPHONE="C"
LC_MEASUREMENT="C"
LC_IDENTIFICATION="C"
LC_ALL=C
$ printf "\x41\x42\x81\x82" | od -t x1
0000000 41 42 81 82
0000004
$ printf "\x41\x42\x81\x82" | tr -c -d "\000-\1777" | od -t x1
0000000 41 42
0000002

The command tr -c -d "\000-\1777" did remove the bytes with values \x81 and \x82. Why does the result of my test not agree with what is written in the specification?

like image 532
Lone Learner Avatar asked Oct 19 '22 15:10

Lone Learner


1 Answers

Since you’re using CentOS, it’s most likely that your tr command is from the GNU coreutils package. GNU tr doesn’t (yet) make a distinction between the behaviour of -c and -C. In recent versions of tr, both -c and -C are equivalent short options for the --complement option.

According to the GNU documentation for tr:

Currently tr fully supports only single-byte characters. Eventually it will support multibyte characters; when it does, the -C option will cause it to complement the set of characters, whereas -c will cause it to complement the set of values. This distinction will matter only when some values are not characters, and this is possible only in locales using multibyte encodings when the input contains encoding errors.

I also found the quoted paragraph from the POSIX specification to be confusingly worded but I’d agree with Etan Reisner’s interpretation that “implementations conforming to the 1993 version of the spec would be broken but earlier implementations (historical) and implementations conforming to the 2008 (and newer) spec would work”.

In any case, GNU tr does not (yet) implement this part of the 2008 POSIX specification (i.e., differentiating between characters and values) so it can’t be used for testing.

By the way, you have a redundant 7 in your tr -c -d "\000-\1777" command.

like image 67
Anthony Geoghegan Avatar answered Oct 22 '22 23:10

Anthony Geoghegan