Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tr [:upper:] [:lower:] with Cyrillic text

Tags:

shell

unicode

tr

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

like image 789
slim Avatar asked Nov 14 '12 15:11

slim


People also ask

Which of the tr command translate the lower case to upper case?

To convert from lower case to upper case the predefined sets in tr can be used. The [:lower:] set will match any lower case character. The [:upper:] set matches any uppercase character. To convert from lower to upper these can be used to translate a string.

What does tr mean in bash?

tr is short for “translate”. It is a member of the GNU coreutils package. Therefore, it's available in all Linux distros. The tr command reads a byte stream from standard input (stdin), translates or deletes characters, then writes the result to the standard output (stdout).

Which of the following command is used to translate from small to capital letters?

Use the tr command to convert all incoming text / words / variable data from upper to lower case or vise versa (translate all uppercase characters to lowercase). Bash version 4.


1 Answers

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк
like image 179
Lev Levitsky Avatar answered Oct 09 '22 23:10

Lev Levitsky