<h3>Question</h3> In a folder, I would like to print the name of every <code>.txt</code> files that contain <code>n=27</code> lines or fewer lines. I could do <pre class="prettyprint"><code>wc -l *.txt | awk '{if ($1 <= 27){print}}' </code></pre> The problem is that many files in the folder are millions of lines (and the lines are pretty long) and hence the command <code>wc -l *.txt</code> is very slow. In principle a process could count the number of lines until finding at least <code>n</code> lines and then proceed to the next file. What is a faster alternative? FYI, I am on <code>MAC OSX 10.11.6</code> <h3>Attempt</h3> Here is an attempt with <code>awk</code> <pre class="prettyprint"><code>#!/bin/awk -f function printPreviousFileIfNeeded(previousNbLines, previousFILENAME) { if (previousNbLines <= n) { print previousNbLines": "previousFILENAME } } BEGIN{ previousNbLines=n+1 previousFILENAME=NA } { if (FNR==1) { printPreviousFileIfNeeded(previousNbLines, previousFILENAME) previousFILENAME=FILENAME } previousNbLines=FNR if (FNR > n) { nextfile } } END{ printPreviousFileIfNeeded(previousNbLines, previousFILENAME) } </code></pre> which can be called as <pre class="prettyprint"><code>awk -v n=27 -f myAwk.awk *.txt </code></pre> However, the code fails at printing out perfectly empty files. I am not sure how to fix that and I am not sure my awk script is the way to go.

With GNU awk for nextfile and ENDFILE: <pre class="prettyprint"><code>awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt </code></pre> With any awk: <pre class="prettyprint"><code>awk -v n=27 ' { fnrs[FILENAME] = FNR } END { for (i=1; i<ARGC; i++) { filename = ARGV[i] if ( fnrs[filename] < n ) { print filename } } } ' *.txt </code></pre> Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers: <ol> <li>It relies on the same file name not appearing multiple times (e.g. <code>awk 'script' foo bar foo</code>) and you wanting it displayed multiple times, and</li> <li>It relies on there being no variables set in the arg list (e.g. <code>awk 'script' foo FS=, bar</code>)</li> </ol> The gawk version has no such restrictions. UPDATE: To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be <code>faster than a pure awk script</code> I created 10,000 input files, all of 0 to 1000 lines in length by using this script: <pre class="prettyprint"><code>$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }' </code></pre> and then ran the 2 commands on them and got these 3rd run timing results: <pre class="prettyprint"><code>$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsed real 0m1.326s user 0m0.249s sys 0m0.654s $ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awk real 0m1.092s user 0m0.343s sys 0m0.748s </code></pre> Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible. <hr> To print 10 lines of up to 20 random chars per line (see the comments): <pre class="prettyprint"><code>$ maxChars=20 LC_ALL=C tr -dc '[:print:]' </dev/urandom | fold -w "$maxChars" | awk -v maxChars="$maxChars" -v numLines=10 ' { print substr($0,1,rand()*(maxChars+1)) } NR==numLines { exit } ' 0J)-8MzO2V\XA/o'qJH @r5|g<WOP780 ^O@bM\ vP{l^pgKUFH9 -6r&]/-6dl}pp W &.UnTYLoi['2CEtB Y~wrM3>4{ ^F1mc9 ?~NHh}a-EEV=O1!y of </code></pre> To do it all within awk (which will be much slower): <pre class="prettyprint"><code>$ cat tst.awk BEGIN { for (i=32; i<127; i++) { chars[++charsSize] = sprintf("%c",i) } minChars = 1 maxChars = 20 srand() for (lineNr=1; lineNr<=10; lineNr++) { numChars = int(minChars + rand() * (maxChars - minChars + 1)) str = "" for (charNr=1; charNr<=numChars; charNr++) { charsIdx = int(1 + rand() * charsSize) str = str chars[charsIdx] } print str } } $ awk -f tst.awk Heer H{QQ?qHDv| Psuq Ey`-:O2v7[]|N^EJ0 j#@/y>CJ3:=3*b-joG: ? ^|O.[tYlmDo TjLw `2Rs= !('IC hui </code></pre>

If you are using GNU <code>grep</code> (unfortunately MacOSX >= 10.8 provides BSD grep whose <code>-m</code> and <code>-c</code> options act globally, not per file), you may find this alternative interesting (and faster than a pure <code>awk</code> script): <pre class="prettyprint"><code>grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' </code></pre> Explanation: <ul> <li> <code>grep -c -m28 -H ^ *.txt</code> outputs the name of each file with the number of lines in each file, but never reading more than 28 lines</li> <li> <code>sed '/:28$/ d; s/:[^:]*$//'</code> removes the files that have at least 28 lines, and print the filename of the others</li> </ul> Alternate version: sequential processing instead of a parallel one <pre class="prettyprint"><code>res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res" </code></pre> <hr> <h3>Benchmarking</h3> Ed Morton challenged my claim that this answer may be faster than <code>awk</code>. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results. First the test platform: <ul> <li>A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).</li> <li> A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command: <pre class="prettyprint"><code>for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done </code></pre> </li> </ul> Results: <ul> <li>grep+sed (this answer) : 561 ms elapsed, 586 ms user+sys</li> <li>grep+sed (this answer, sequential version) : 678 ms elapsed, 688 ms user+sys</li> <li> awk (Ed Morton): 1050 ms elapsed, 1036 ms user+sys</li> <li> awk (tripleee): 1137 ms elapsed, 1123 ms user+sys </li> <li> awk (anubhava): 1150 ms elapsed, 1137 ms user+sys</li> <li> awk (kvantour): 1280 ms elapsed, 1266 ms user+sys</li> <li> python (Joey Harrington): 1543 ms elapsed, 1537 ms user+sys</li> <li> find+xargs+sed (agc): 91 s elapsed, 10 s user+sys</li> <li> for+awk (Jeff Schaller): 247 s elapsed, 83 s user+sys</li> <li> find+bash+grep (hek2mgl): 356 s elapsed, 116 s user+sys</li> </ul> Conclusion: At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script. Notes: <ul> <li>Why does the platform matter? Because my answer relies on parallelizing the processing between <code>grep</code> and <code>sed</code>. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.</li> <li>Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings</li> <li>Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)</li> <li>For all the answers that take less than 10 s, the time spent by the shell to process <code>*.txt</code> is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.</li> <li>All answers gave the same results except 1. tripleee's answer which includes <code>argv[0]</code> ("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with <code>-v n=27</code>); and 3. the find+sed answer which miss empty files (not fixed).</li> <li>I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.</li> <li>The python answer doesn't close its files. I had to do <code>ulimit -n hard</code> first.</li> </ul>

You may try this <code>awk</code> that moves to next file as soon as line count goes above <code>27</code>: <pre class="prettyprint"><code>awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]} FNR > n{delete f[FILENAME]; nextfile} END{for (i in f) print i}' *.txt </code></pre> <code>awk</code> processes files line by line so it won't attempt to read complete file to get the line count.

List files that contain `n` or fewer lines

Question

In a folder, I would like to print the name of every .txt files that contain n=27 lines or fewer lines. I could do

wc -l *.txt | awk '{if ($1 <= 27){print}}'

The problem is that many files in the folder are millions of lines (and the lines are pretty long) and hence the command wc -l *.txt is very slow. In principle a process could count the number of lines until finding at least n lines and then proceed to the next file.

What is a faster alternative?

FYI, I am on MAC OSX 10.11.6

Attempt

Here is an attempt with awk

#!/bin/awk -f

function printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
{
  if (previousNbLines <= n) 
  {
    print previousNbLines": "previousFILENAME
  }
}

BEGIN{
  previousNbLines=n+1
  previousFILENAME=NA
} 


{
  if (FNR==1)
  {
    printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
    previousFILENAME=FILENAME
  }
  previousNbLines=FNR
  if (FNR > n)
  {
    nextfile
  }
}

END{
  printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
}

which can be called as

awk -v n=27 -f myAwk.awk *.txt

However, the code fails at printing out perfectly empty files. I am not sure how to fix that and I am not sure my awk script is the way to go.

764

asked Oct 03 '18 17:10

Remi.b

4 Answers

With GNU awk for nextfile and ENDFILE:

awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt

With any awk:

awk -v n=27 '
    { fnrs[FILENAME] = FNR }
    END {
        for (i=1; i<ARGC; i++) {
            filename = ARGV[i]
            if ( fnrs[filename] < n ) {
                print filename
            }
        }
    }
' *.txt

Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:

It relies on the same file name not appearing multiple times (e.g. awk 'script' foo bar foo) and you wanting it displayed multiple times, and
It relies on there being no variables set in the arg list (e.g. awk 'script' foo FS=, bar)

The gawk version has no such restrictions.

UPDATE:

To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script I created 10,000 input files, all of 0 to 1000 lines in length by using this script:

$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'

and then ran the 2 commands on them and got these 3rd run timing results:

$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsed

real    0m1.326s
user    0m0.249s
sys     0m0.654s

$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awk

real    0m1.092s
user    0m0.343s
sys     0m0.748s

Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.

To print 10 lines of up to 20 random chars per line (see the comments):

$ maxChars=20
    LC_ALL=C tr -dc '[:print:]' </dev/urandom |
    fold -w "$maxChars" |
    awk -v maxChars="$maxChars" -v numLines=10 '
        { print substr($0,1,rand()*(maxChars+1)) }
        NR==numLines { exit }
    '
0J)-8MzO2V\XA/o'qJH
@r5|g<WOP780
^O@bM\
vP{l^pgKUFH9
-6r&]/-6dl}pp W
&.UnTYLoi['2CEtB
Y~wrM3>4{
^F1mc9
?~NHh}a-EEV=O1!y
of

To do it all within awk (which will be much slower):

$ cat tst.awk
BEGIN {
    for (i=32; i<127; i++) {
        chars[++charsSize] = sprintf("%c",i)
    }
    minChars = 1
    maxChars = 20
    srand()
    for (lineNr=1; lineNr<=10; lineNr++) {
        numChars = int(minChars + rand() * (maxChars - minChars + 1))
        str = ""
        for (charNr=1; charNr<=numChars; charNr++) {
            charsIdx = int(1 + rand() * charsSize)
            str = str chars[charsIdx]
        }
        print str
    }
}

$ awk -f tst.awk
Heer H{QQ?qHDv|
Psuq
Ey`-:O2v7[]|N^EJ0
j#@/y>CJ3:=3*b-joG:
?
^|O.[tYlmDo
TjLw
`2Rs=
!('IC
hui

128

answered Oct 05 '22 22:10

Ed Morton

If you are using GNU grep (unfortunately MacOSX >= 10.8 provides BSD grep whose -m and -c options act globally, not per file), you may find this alternative interesting (and faster than a pure awk script):

grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//'

Explanation:

grep -c -m28 -H ^ *.txt outputs the name of each file with the number of lines in each file, but never reading more than 28 lines
sed '/:28$/ d; s/:[^:]*$//' removes the files that have at least 28 lines, and print the filename of the others

Alternate version: sequential processing instead of a parallel one

res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res"

Benchmarking

Ed Morton challenged my claim that this answer may be faster than awk. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results.

First the test platform:

A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).
A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command:
```
for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done
```

Results:

grep+sed (this answer) : 561 ms elapsed, 586 ms user+sys
grep+sed (this answer, sequential version) : 678 ms elapsed, 688 ms user+sys
awk (Ed Morton): 1050 ms elapsed, 1036 ms user+sys
awk (tripleee): 1137 ms elapsed, 1123 ms user+sys
awk (anubhava): 1150 ms elapsed, 1137 ms user+sys
awk (kvantour): 1280 ms elapsed, 1266 ms user+sys
python (Joey Harrington): 1543 ms elapsed, 1537 ms user+sys
find+xargs+sed (agc): 91 s elapsed, 10 s user+sys
for+awk (Jeff Schaller): 247 s elapsed, 83 s user+sys
find+bash+grep (hek2mgl): 356 s elapsed, 116 s user+sys

Conclusion:

At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script.

Notes:

Why does the platform matter? Because my answer relies on parallelizing the processing between grep and sed. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.
Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings
Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)
For all the answers that take less than 10 s, the time spent by the shell to process *.txt is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.
All answers gave the same results except 1. tripleee's answer which includes argv[0] ("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with -v n=27); and 3. the find+sed answer which miss empty files (not fixed).
I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.
The python answer doesn't close its files. I had to do ulimit -n hard first.

answered Oct 05 '22 23:10

xhienne

You may try this awk that moves to next file as soon as line count goes above 27:

awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]}
FNR > n{delete f[FILENAME]; nextfile}
END{for (i in f) print i}' *.txt

awk processes files line by line so it won't attempt to read complete file to get the line count.

answered Oct 05 '22 23:10

anubhava

How's this?

awk 'BEGIN { for(i=1;i<ARGC; ++i) arg[ARGV[i]] }
  FNR==28 { delete arg[FILENAME]; nextfile }
  END { for (file in arg) print file }' *.txt

We copy the list of file name arguments to an associative array, then remove all files which have a 28th line from it. Empty files obviously won't match this condition, so at the end, we are left with all files which have fewer lines, including the empty ones.

nextfile was a common extension in many Awk variants and then was codified by POSIX in 2012. If you need this to work on really old dinosaur OSes (or, good heavens, probably Windows), good luck, and/or try GNU Awk.

answered Oct 05 '22 23:10

tripleee

Related questions
                            
                                Read a config file in BASH without using "source"
                            
                                Bash script listen for key press to move on
                            
                                How to set a conditional newline in PS1?
                            
                                Why are the bash -n and -z test operators not inverses for $@
                            
                                VSCode Integrated Terminal creates a separate window
                            
                                Automatically timing every executed command and show in Bash prompt? [duplicate]
                            
                                Bash relative date (x days ago)
                            
                                Tailing Rolling Files
                            
                                What is the meaning of !# (bang-pound) in a sh / Bash shell script?
                            
                                What is more portable? echo -e or using printf?
                            
                                In Bash how do you see if a string is not in an array?
                            
                                How to match nothing if a file name glob has no matches [duplicate]
                            
                                Get exit code of process substitution with pipe into while loop
                            
                                How does subprocess.call() work with shell=False?
                            
                                How do I get hex blocks from a base 64 encoded string?
                            
                                What's the difference of using #!/usr/bin/env or #!/bin/env in shebang?
                            
                                AWK: return value to shell script
                            
                                Rename multiple directories matching pattern
                            
                                How can I stop a here string (<<<) from adding a line break or new lines?
                            
                                How to use printf "%q " in bash?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

List files that contain `n` or fewer lines

Tags:

performance

file

bash

shell

awk

Question

Attempt

Remi.b

People also ask

4 Answers

Ed Morton

Benchmarking

xhienne

anubhava

tripleee

Recent Activity

Donate For Us