Sometimes I'm grep
-ing thousands of files and it'd be nice to see some kind of progress (bar or status).
I know this is not trivial because grep
outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .
Would this require modifying source code of grep
?
Ideal command is:
grep -e "STRING" --results="FILE.txt"
and the progress:
[curr file being searched], number x/total number of files
written to STDOUT or STDERR
In recent versions of pv there is an "-d" -Option to watch all the FDs of another process. For the Problem above a simpler idea is the following: While the grep is running, use lsof together with watch . That way you can monitor the progress of your grep.
Using Grep to Filter the Output of a CommandA command's output can be filtered with grep through piping, and only the lines matching a given pattern will be printed on the terminal. You can also chain multiple pipes in on command. As you can see in the output above there is also a line containing the grep process.
Option 1: Use the dd Command to Show Progress While the system is copying the specified file, it shows the amount of data that has been copied and the time elapsed. Once the process is complete, the terminal displays the total amount of data transferred and the time duration of the process.
The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in. The output is the three lines in the file that contain the letters 'not'.
This wouldn't necessarily require modifying grep
, although you could probably get a more accurate progress bar with such a modification.
If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r
option to recursively a directory structure. In that case, it is not even clear that grep
knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)
In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep
in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat
to get the filesize) would make the progress report more exact but add an additional cost to process startup.
One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.
In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).
# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
echo $i/$total >>/dev/stderr
grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done
For simplicity, I use a globstar (**
) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find
, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/
only matches directories.) Fortunately, GNU grep provides the -d skip
option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.
You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.
The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:
find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt
(Here -L 100
specifies that up to 100 files should be given to each grep instance, and -j 4
specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With