Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is "grep --ignore-case" 50 times slower?

I was very surprised to see that when you add the --ignore-case option to grep that it can slow down the search by 50x times. I've tested this on two different machines with the same result. I am curious to find out an explanation for the huge performance difference.

I would also like to see an alternative command to grep for case-insensitive searches. I don't need regular expressions, just fixed string searching. First the test file will be a 50 MB plain text file with some dummy data, you may use the following code to generate it:

Create test.txt

yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt

Demonstration

Below is a demonstration of the slowness. By adding the --ignore-case option the command becomes 57x times slower.

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.061s

$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.498s

Possible Explanations

Googling around I found an a discussion on grep being slow in the UTF-8 locale. So I ran the following test, and it did speed up. The default locale on my machine is en_US.UTF-8, so setting it to POSIX seems to have made a performance boot, but now of course I can't search correctly on Unicode text which is undesirable. It is also still 2.5 times slower.

$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.142s

Alternatives

We could use Perl instead it is faster, but still 5.5 times faster then the case sensitive grep. And the POSIX grep above is about twice as fast.

$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.388s

So I'd love to find a fast correct alternative and an explanation if anyone has one.

UPDATE - CentOS

The two machines that were tested above both were running Ubuntu one 11.04 (Natty Narwhal), the other 12.04 (Precise Pangolin). Running the same tests on a CentOS 5.3 machine produces the following interesting results. The performance results of the two cases are almost identical. Now CentOS 5.3 was released in Jan 2009 an is running grep 2.5.1 while Ubuntu 12.04 is running grep 2.10. So there might be changes in the new version and differences in the two distributions.

$ time grep fun test.txt
Jack is no fun
real    0m0.026s

$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real    0m0.027s
like image 254
Marwan Alsabbagh Avatar asked Dec 11 '12 11:12

Marwan Alsabbagh


2 Answers

I think this bug report helps in understanding why it is slow:

bug report grep, slow on ignore-case

like image 179
Peter Avatar answered Sep 27 '22 23:09

Peter


This slowness is due to grep (on a UTF-8 locale) constantly accesses files "/usr/lib/locale/locale-archive" and "/usr/lib/gconv/gconv-modules.cache".

It can be shown using the strace utility. Both files are from glibc.

like image 21
Marat Buharov Avatar answered Sep 27 '22 22:09

Marat Buharov