I'm trying to extract a word list from a Russian short story.
#!/bin/sh
export LC_ALL=ru_RU.utf8
sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq
However the tr
step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!
$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г
In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).
What is a portable, reliable way to lowercase unicode text in a pipe?
To convert from lower case to upper case the predefined sets in tr can be used. The [:lower:] set will match any lower case character. The [:upper:] set matches any uppercase character. To convert from lower to upper these can be used to translate a string.
tr is short for “translate”. It is a member of the GNU coreutils package. Therefore, it's available in all Linux distros. The tr command reads a byte stream from standard input (stdin), translates or deletes characters, then writes the result to the standard output (stdout).
Use the tr command to convert all incoming text / words / variable data from upper to lower case or vise versa (translate all uppercase characters to lowercase). Bash version 4.
This is what I found at Wikipedia (without any reference, though):
Most versions of
tr
, includingGNU tr
and classic Unixtr
, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.
Also, this is old but related.
As I mentioned in the comment, sed
seems to work (GNU sed
, at least):
$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With