tr [:upper:] [:lower:] with Cyrillic text

Tags:

I'm trying to extract a word list from a Russian short story.

#!/bin/sh

export LC_ALL=ru_RU.utf8

sed -re 's/\s+/\n/g' | \
sed 's/[\.!,—()«»;:?]//g' | \
tr '[:upper:]' '[:lower:]' | \
sort | uniq

However the tr step is not lowercasing the Cyrillic capital letters. I thought I was being clever using the portable character classes!

$ LC_ALL=ru_RU.utf8 echo "Г" | tr [:upper:] [:lower:]
Г

In case it's relevant, I obtained the Russian text by copy-pasting from a Chrome browser window into Vim. It looks right on screen (a Putty terminal). This is in Cygwin's bash shell -- it should work identically to Bash on Linux (should!).

What is a portable, reliable way to lowercase unicode text in a pipe?

789

asked Nov 14 '12 15:11

slim

1 Answers

This is what I found at Wikipedia (without any reference, though):

Most versions of tr, including GNU tr and classic Unix tr, operate on single-byte characters and are not Unicode compliant. An exception is the Heirloom Toolchest implementation, which provides basic Unicode support.

Also, this is old but related.

As I mentioned in the comment, sed seems to work (GNU sed, at least):

$ echo 'СТЭК' | sed 's/[[:upper:]]*/\L&/'
стэк

179

answered Oct 09 '22 23:10

Lev Levitsky

Related questions
                            
                                Export variables defined in another file
                            
                                "set: illegal option -" on one host but not the other
                            
                                Should I escape shell arguments in Perl?
                            
                                Running a command as a background process/service
                            
                                Disable history in Linux [closed]
                            
                                sed command to replace multiple spaces into single spaces
                            
                                Shell: list directories ordered by file count (including in subdirectories)
                            
                                Extract lines between two line numbers in shell
                            
                                Bash script: difference in minutes between two times
                            
                                Why can't I use 'sudo su' within a shell script? How to make a shell script run with sudo automatically
                            
                                What is your latest useful Perl one-liner (or a pipe involving Perl)? [closed]
                            
                                How to fix "sh: 0: Can't open start.sh" in docker file?
                            
                                Errno::ENOMEM: Cannot allocate memory - cat
                            
                                How can I run a test suite using gradle from the command line
                            
                                How to check if the sed command replaced some string? [duplicate]
                            
                                Remove all files older than X days, but keep at least the Y youngest [duplicate]
                            
                                How `[System.Console]::OutputEncoding/InputEncoding` with Python?
                            
                                The difference between arguments and options pertaining to the linux shell
                            
                                Switch user without creating an intermediate process
                            
                                how to start django shell with ipython in qtconsole mode?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

tr [:upper:] [:lower:] with Cyrillic text

Tags:

shell

unicode

tr

slim

People also ask

1 Answers

Lev Levitsky

Recent Activity

Donate For Us