Sometimes I'm <code>grep</code>-ing thousands of files and it'd be nice to see some kind of progress (bar or status). I know this is not trivial because <code>grep</code> outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR . Would this require modifying source code of <code>grep</code>? Ideal command is: <code>grep -e "STRING" --results="FILE.txt"</code> and the progress: <pre class="prettyprint"><code>[curr file being searched], number x/total number of files </code></pre> written to STDOUT or STDERR

This wouldn't necessarily require modifying <code>grep</code>, although you could probably get a more accurate progress bar with such a modification. If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the <code>-r</code> option to recursively a directory structure. In that case, it is not even clear that <code>grep</code> knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.) In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to <code>grep</code> in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, <code>stat</code> to get the filesize) would make the progress report more exact but add an additional cost to process startup. One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit. <hr> In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize). <pre class="prettyprint"><code># Requires bash 4 and Gnu grep shopt -s globstar files=(**) total=${#files[@]} for ((i=0; i<total; i+=100)); do echo $i/$total >>/dev/stderr grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt done </code></pre> For simplicity, I use a globstar (<code>**</code>) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of <code>find</code>, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (<code>**/</code> only matches directories.) Fortunately, GNU grep provides the <code>-d skip</code> option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference. You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started. The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this: <pre class="prettyprint"><code>find . -type f -print0 | parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt </code></pre> (Here <code>-L 100</code> specifies that up to 100 files should be given to each grep instance, and <code>-j 4</code> specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

grep - how to output progress bar or status

Tags:

grep

bash

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).

I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .

Would this require modifying source code of grep?

Ideal command is:

grep -e "STRING" --results="FILE.txt"

and the progress:

[curr file being searched], number x/total number of files

written to STDOUT or STDERR

204

asked Jun 07 '16 15:06

Bob

1 Answers

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.

If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)

In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.

One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.

In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).

# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
  echo $i/$total >>/dev/stderr
  grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done

For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.

You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.

The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:

find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt

(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

answered Sep 21 '22 13:09

rici

Related questions
                            
                                Bash Operator error: No such file or directory in airflow
                            
                                Function local read-only vs. global read-only variable with the same name
                            
                                Execute php script from bash , assign output to a bash variable
                            
                                How to export an associative array (hash) in bash?
                            
                                Garbage collection in bash
                            
                                How to create a symbolic link when target directory doesn't exist?
                            
                                "source" command in shell script not working [duplicate]
                            
                                abstracting the conversion between id3 tags, m4a tags, flac tags
                            
                                Force bash to use .vimrc in vi mode
                            
                                grep case sensitive [A-Z]?
                            
                                What is a convention for naming a constant in Bash?
                            
                                zsh: how to make tab completion need no space to next word after cursor?
                            
                                Opening a Sublime project from the command line without opening a blank window
                            
                                Sorting csv file by 5th column using bash
                            
                                How do I escape a series of backslashes in a bash printf?
                            
                                $${HOME} or ${HOME} in Makefile?
                            
                                Python equivalent to perl -pe?
                            
                                are there uses for '>&0' (redirect to stdin)?
                            
                                Setting variable in bash -c
                            
                                How to run a .sh-script from any path in a terminal?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With