Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implications of LC_ALL=C to speedup grep

Tags:

I just discovered that if i prefix my grep commands with a LC_ALL=C it does wonders for speeding grep up.

But i am wondering about the implications.

Would a pattern using UTF-8 not match? What happens if the grepped file is using UTF-8?

like image 927
elhoim Avatar asked Nov 15 '11 14:11

elhoim


People also ask

How can I speed up my grep?

Here are a few options: 1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8. 2) Use fgrep because you're searching for a fixed string, not a regular expression. 3) Remove the -i option, if you don't need it.

What does Lc_all C do?

LC_ALL and the sort Command. Setting LC_ALL to the particular value “C” is a simple yet powerful way to force the locale to use the default language while using byte-wise sorting.

What should Lc_all be set to?

The value 'LC_ALL=C' is essentially an English-only environment that specifies the ANSI C locale. Some language setting for LC_ALL are "ja" for Japanese and "us" for US English. For example, 'LC_ALL=ja'. You may find you need to set this to get RIM to work.


1 Answers

You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false

When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
ä

I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.

like image 145
thiton Avatar answered Oct 05 '22 04:10

thiton