I just discovered that if i prefix my grep commands with a LC_ALL=C it does wonders for speeding grep up. But i am wondering about the implications. Would a pattern using UTF-8 not match? What happens if the grepped file is using UTF-8?

You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples: <pre class="prettyprint"><code>$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false ä $ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false false </code></pre> When trying to match exact binary patterns against each other, the locale doesn't make a difference, however: <pre class="prettyprint"><code>$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false ä $ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false ä </code></pre> I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.

Implications of LC_ALL=C to speedup grep

1 Answers

You don't necessarily need UTF-8 to run into trouble here. The locale is responsible for setting the character classes, i.e. determining which character is a space, a letter or a digit. Consider these two examples:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep '[[:alnum:]]' || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep '[[:alnum:]]' || echo false
false

When trying to match exact binary patterns against each other, the locale doesn't make a difference, however:

$ echo -e '\xe4' | LC_ALL=en_US.iso88591 grep "$(echo -e '\xe4')" || echo false
ä
$ echo -e '\xe4' | LC_ALL=C grep "$(echo -e '\xe4')" || echo false
ä

I'm not sure about the extent of grep implementing unicode, and how well different codepoints are matched to each other, but matching any subset of ASCII and the matching of single characters without alternate binary representations should work fine regardless of locale.

145

answered Oct 05 '22 04:10

thiton

Related questions
                            
                                Sourcing an R script from github, for global session use, from within a wrapper function?
                            
                                Is there a way to use Jenkins with Github Pull Requests?
                            
                                How to setup googletest on Linux in the year 2012?
                            
                                Exact figure size in matplotlib with title, axis labels
                            
                                How Lambda Expression works
                            
                                How to work with interfaces and JPA
                            
                                Completeness of depth-first search
                            
                                In clojure, how to merge several maps combining mappings with same key into a list?
                            
                                accept multiple types for a parameter in scala
                            
                                HTML5 resumable and chunked upload of large files (> 500MB)
                            
                                button with display:block not stretched
                            
                                How to change the language of a WebDriver?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Implications of LC_ALL=C to speedup grep

Tags:

elhoim

People also ask

1 Answers

thiton

Recent Activity

Donate For Us