Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep - how to output progress bar or status

Tags:

grep

bash

Sometimes I'm grep-ing thousands of files and it'd be nice to see some kind of progress (bar or status).

I know this is not trivial because grep outputs the search results to STDOUT and my default workflow is that I output the results to a file and would like the progress bar/status to be output to STDOUT or STDERR .

Would this require modifying source code of grep?

Ideal command is:

grep -e "STRING" --results="FILE.txt"

and the progress:

[curr file being searched], number x/total number of files

written to STDOUT or STDERR

like image 204
Bob Avatar asked Jun 07 '16 15:06

Bob


People also ask

How do I see grep progress?

In recent versions of pv there is an "-d" -Option to watch all the FDs of another process. For the Problem above a simpler idea is the following: While the grep is running, use lsof together with watch . That way you can monitor the progress of your grep.

Can you grep the output of a command?

Using Grep to Filter the Output of a CommandA command's output can be filtered with grep through piping, and only the lines matching a given pattern will be printed on the terminal. You can also chain multiple pipes in on command. As you can see in the output above there is also a line containing the grep process.

How do you check progress in Linux?

Option 1: Use the dd Command to Show Progress While the system is copying the specified file, it shows the amount of data that has been copied and the time elapsed. Once the process is complete, the terminal displays the total amount of data transferred and the time duration of the process.

How do I use grep to search output?

The grep command searches through the file, looking for matches to the pattern specified. To use it type grep , then the pattern we're searching for and finally the name of the file (or files) we're searching in. The output is the three lines in the file that contain the letters 'not'.


1 Answers

This wouldn't necessarily require modifying grep, although you could probably get a more accurate progress bar with such a modification.

If you are grepping "thousands of files" with a single invocation of grep, it is most likely that you are using the -r option to recursively a directory structure. In that case, it is not even clear that grep knows how many files it will examine, because I believe it starts examining files before it explores the entire directory structure. Exploring the directory structure first would probably increase the total scan time (and, indeed, there is always a cost to producing progress reports, which is why few traditional Unix utilities do this.)

In any case, a simple but slightly inaccurate progress bar could be obtained by constructing the complete list of files to be scanned and then feeding them to grep in batches of some size, maybe 100, or maybe based on the total size of the batch. Small batches would allow for more accurate progress reports but they would also increase overhead since they would require additional grep process start-up, and the process start-up time can be more than grepping a small file. The progress report would be updated for each batch of files, so you would want to choose a batch size that gave you regular updates without increasing overhead too much. Basing the batch size on the total size of the files (using, for example, stat to get the filesize) would make the progress report more exact but add an additional cost to process startup.

One advantage of this strategy is that you could also run two or more greps in parallel, which might speed the process up a bit.


In broad terms, a simple script (which just divides the files by count, not by size, and which doesn't attempt to parallelize).

# Requires bash 4 and Gnu grep
shopt -s globstar
files=(**)
total=${#files[@]}
for ((i=0; i<total; i+=100)); do
  echo $i/$total >>/dev/stderr
  grep -d skip -e "$pattern" "${files[@]:i:100}" >>results.txt
done

For simplicity, I use a globstar (**) to safely put all the files in an array. If your version of bash is too old, then you can do it by looping over the output of find, but that's not very efficient if you have lots of files. Unfortunately, there is no way that I know of to write a globstar expression which only matches files. (**/ only matches directories.) Fortunately, GNU grep provides the -d skip option which silently skips directories. That means that the file count will be slightly inaccurate, since directories will be counted, but it probably doesn't make much difference.

You probably will want to make the progress report cleaner by using some console codes. The above is just to get you started.

The simplest way to divide that into different processes would be to just divide the list into X different segments and run X different for loops, each with a different starting point. However, they probably won't all finish at the same time so that is sub-optimal. A better solution is GNU parallel. You might do something like this:

find . -type f -print0 |
parallel --progress -L 100 -m -j 4 grep -e "$pattern" > results.txt

(Here -L 100 specifies that up to 100 files should be given to each grep instance, and -j 4 specifies four parallel processes. I just pulled those numbers out of the air; you'll probably want to adjust them.)

like image 53
rici Avatar answered Sep 21 '22 13:09

rici