I was very surprised to see that when you add the --ignore-case
option to grep
that it can slow down the search by 50x times. I've tested this on two different machines with the same result. I am curious to find out an explanation for the huge performance difference.
I would also like to see an alternative command to grep for case-insensitive searches. I don't need regular expressions, just fixed string searching. First the test file will be a 50 MB plain text file with some dummy data, you may use the following code to generate it:
Create test.txt
yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt
Demonstration
Below is a demonstration of the slowness. By adding the --ignore-case
option the command becomes 57x times slower.
$ time grep fun test.txt
all work and no plJack is no fun
real 0m0.061s
$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m3.498s
Possible Explanations
Googling around I found an a discussion on grep being slow in the UTF-8 locale. So I ran the following test, and it did speed up. The default locale on my machine is en_US.UTF-8
, so setting it to POSIX
seems to have made a performance boot, but now of course I can't search correctly on Unicode text which is undesirable. It is also still 2.5 times slower.
$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.142s
Alternatives
We could use Perl instead it is faster, but still 5.5 times faster then the case sensitive grep. And the POSIX grep above is about twice as fast.
$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.388s
So I'd love to find a fast correct alternative and an explanation if anyone has one.
UPDATE - CentOS
The two machines that were tested above both were running Ubuntu one 11.04 (Natty Narwhal), the other 12.04 (Precise Pangolin). Running the same tests on a CentOS 5.3 machine produces the following interesting results. The performance results of the two cases are almost identical. Now CentOS 5.3 was released in Jan 2009 an is running grep 2.5.1 while Ubuntu 12.04 is running grep 2.10. So there might be changes in the new version and differences in the two distributions.
$ time grep fun test.txt
Jack is no fun
real 0m0.026s
$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real 0m0.027s
I think this bug report helps in understanding why it is slow:
bug report grep, slow on ignore-case
This slowness is due to grep (on a UTF-8 locale) constantly accesses files "/usr/lib/locale/locale-archive" and "/usr/lib/gconv/gconv-modules.cache".
It can be shown using the strace utility. Both files are from glibc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With