I just discovered that if i prefix my grep commands with a LC_ALL=C it does wonders for speeding grep up.
But i am wondering about the implications.
Would a pattern using UTF-8 not match? What happens if the grepped file is using UTF-8?
Here are a few options: 1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. 2) Use fgrep because you're searching for a fixed string, not a regular expression. 3) Remove the -i option, if you don't need it.
LC_ALL and the sort Command. Setting LC_ALL to the particular value “C” is a simple yet powerful way to force the locale to use the default language while using byte-wise sorting.
The value 'LC_ALL=C' is essentially an English-only environment that specifies the ANSI C locale. Some language setting for LC_ALL are "ja" for Japanese and "us" for US English. For example, 'LC_ALL=ja'. You may find you need to set this to get RIM to work.
You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:
$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false
When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:
$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
ä
I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With