I was doing some everyday grepping and suddenly discovered that something seemingly trivial does not work:
$ echo T | grep [A-Z]
No match.
How come T is not within A-Z range?
I changed the regex a tiny bit:
$ echo T | grep [A-Y]
A match!
Whoa! How is T within A-Y but not within A-Z?
Apparently this is because my environment is set to Estonian locale where Y is at the end of the alphabet but Z is somewhere in the middle: ABCDEFGHIJKLMNOPQRSŠZŽTUVWÕÄÖÜXY
$ echo $LANG
et_EE.UTF-8
This all came as a bit of a shock to me. 99% of the time I grep computer code, not Estonian literature. Have I been using grep the wrong way all the time? What all kind of mistakes have I made because of this in the past?
After trying several things I arrived at the following solution:
$ echo T | LANG=C grep [A-Z]
Is this the recommended way to make grep locale-independent?
Further more... would it be safe to define an alias like that:
$ alias grep="LANG=C grep"
PS. I'm also wondering of why are the character ranges like [A-Z]
locale dependent in the first place while \w
seems to be unaffected by locale (although the manual says \w
is equivalent of [[:alnum:]]
- but I found out the latter depends on locale while \w
does not).
POSIX regular expressions, which Linux and FreeBSD grep support naturally, and some others support on request, have a series of [:xxx:] patterns that honor locales. See the man page for details.
grep '[[:upper:]]'
As the []s are part of the pattern name you need the outer [] as well, regardless of how strange it looks.
With the advent of these : codes the classic \w, etc., remain strictly in the C locale. Thus your choice of patterns determines if grep uses the current locale or not.
[A-Z] should follow locale, but you may need to set LC_ALL rather than LANG, especially if the system sets LC_ALL to a different value for your.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With