Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grepping a huge file (80GB) any way to speed it up?

Tags:

grep

bash

People also ask

How do you grep fast?

If you just need matching filenames, and not the actual matches found in the files, then you should run grep with the -l flag. This flag causes grep to just print filenames that match, and not print the matching lines.

Does grep have a file size limit?

Lack of disk space or exceeding enabled quotas will also cause the output file to truncate. grep has a line length limit of 2048 characters. There also is a concept of largefiles, files which are so …

Is Egrep faster than grep?

The egrep command allows the use of extended regex. The fgrep command on the other hand works on fixed string instead of a regex. This means that it takes the search pattern as it is for searching and thus it is faster than grep.


Here are a few options:

1) Prefix your grep command with LC_ALL=C to use the C locale instead of UTF-8.

2) Use fgrep because you're searching for a fixed string, not a regular expression.

3) Remove the -i option, if you don't need it.

So your command becomes:

LC_ALL=C fgrep -A 5 -B 5 'db_pd.Clients' eightygigsfile.sql

It will also be faster if you copy your file to RAM disk.


If you have a multicore CPU, I would really recommend GNU parallel. To grep a big file in parallel use:

< eightygigsfile.sql parallel --pipe grep -i -C 5 'db_pd.Clients'

Depending on your disks and CPUs it may be faster to read larger blocks:

< eightygigsfile.sql parallel --pipe --block 10M grep -i -C 5 'db_pd.Clients'

It's not entirely clear from you question, but other options for grep include:

  • Dropping the -i flag.
  • Using the -F flag for a fixed string
  • Disabling NLS with LANG=C
  • Setting a max number of matches with the -m flag.

Some trivial improvement:

  • Remove the -i option, if you can, case insensitive is quite slow.

  • Replace the . by \.

    A single point is the regex symbol to match any character, which is also slow


Two lines of attack:

  • are you sure, you need the -i, or do you habe a possibility to get rid of it?
  • Do you have more cores to play with? grep is single-threaded, so you might want to start more of them at different offsets.

< eightygigsfile.sql parallel -k -j120% -n10 -m grep -F -i -C 5 'db_pd.Clients'  

If you need to search for multiple strings, grep -f strings.txt saves a ton of time. The above is a translation of something that I am currently testing. the -j and -n option value seemed to work best for my use case. The -F grep also made a big difference.