Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List files that contain `n` or fewer lines

Question

In a folder, I would like to print the name of every .txt files that contain n=27 lines or fewer lines. I could do

wc -l *.txt | awk '{if ($1 <= 27){print}}'

The problem is that many files in the folder are millions of lines (and the lines are pretty long) and hence the command wc -l *.txt is very slow. In principle a process could count the number of lines until finding at least n lines and then proceed to the next file.

What is a faster alternative?

FYI, I am on MAC OSX 10.11.6

Attempt

Here is an attempt with awk

#!/bin/awk -f

function printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
{
  if (previousNbLines <= n) 
  {
    print previousNbLines": "previousFILENAME
  }
}

BEGIN{
  previousNbLines=n+1
  previousFILENAME=NA
} 


{
  if (FNR==1)
  {
    printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
    previousFILENAME=FILENAME
  }
  previousNbLines=FNR
  if (FNR > n)
  {
    nextfile
  }
}

END{
  printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
}

which can be called as

awk -v n=27 -f myAwk.awk *.txt

However, the code fails at printing out perfectly empty files. I am not sure how to fix that and I am not sure my awk script is the way to go.

like image 764
Remi.b Avatar asked Oct 03 '18 17:10

Remi.b


People also ask

How do I sort the number of lines in Linux?

How to sort by number. To sort by number pass the -n option to sort . This will sort from lowest number to highest number and write the result to standard output. Suppose a file exists with a list of items of clothing that has a number at the start of the line and needs to be sorted numerically.


4 Answers

With GNU awk for nextfile and ENDFILE:

awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt

With any awk:

awk -v n=27 '
    { fnrs[FILENAME] = FNR }
    END {
        for (i=1; i<ARGC; i++) {
            filename = ARGV[i]
            if ( fnrs[filename] < n ) {
                print filename
            }
        }
    }
' *.txt

Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:

  1. It relies on the same file name not appearing multiple times (e.g. awk 'script' foo bar foo) and you wanting it displayed multiple times, and
  2. It relies on there being no variables set in the arg list (e.g. awk 'script' foo FS=, bar)

The gawk version has no such restrictions.

UPDATE:

To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script I created 10,000 input files, all of 0 to 1000 lines in length by using this script:

$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'

and then ran the 2 commands on them and got these 3rd run timing results:

$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsed

real    0m1.326s
user    0m0.249s
sys     0m0.654s

$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awk

real    0m1.092s
user    0m0.343s
sys     0m0.748s

Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.


To print 10 lines of up to 20 random chars per line (see the comments):

$ maxChars=20
    LC_ALL=C tr -dc '[:print:]' </dev/urandom |
    fold -w "$maxChars" |
    awk -v maxChars="$maxChars" -v numLines=10 '
        { print substr($0,1,rand()*(maxChars+1)) }
        NR==numLines { exit }
    '
0J)-8MzO2V\XA/o'qJH
@r5|g<WOP780
^O@bM\
vP{l^pgKUFH9
-6r&]/-6dl}pp W
&.UnTYLoi['2CEtB
Y~wrM3>4{
^F1mc9
?~NHh}a-EEV=O1!y
of

To do it all within awk (which will be much slower):

$ cat tst.awk
BEGIN {
    for (i=32; i<127; i++) {
        chars[++charsSize] = sprintf("%c",i)
    }
    minChars = 1
    maxChars = 20
    srand()
    for (lineNr=1; lineNr<=10; lineNr++) {
        numChars = int(minChars + rand() * (maxChars - minChars + 1))
        str = ""
        for (charNr=1; charNr<=numChars; charNr++) {
            charsIdx = int(1 + rand() * charsSize)
            str = str chars[charsIdx]
        }
        print str
    }
}

$ awk -f tst.awk
Heer H{QQ?qHDv|
Psuq
Ey`-:O2v7[]|N^EJ0
j#@/y>CJ3:=3*b-joG:
?
^|O.[tYlmDo
TjLw
`2Rs=
!('IC
hui
like image 128
Ed Morton Avatar answered Oct 05 '22 22:10

Ed Morton


If you are using GNU grep (unfortunately MacOSX >= 10.8 provides BSD grep whose -m and -c options act globally, not per file), you may find this alternative interesting (and faster than a pure awk script):

grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//'

Explanation:

  • grep -c -m28 -H ^ *.txt outputs the name of each file with the number of lines in each file, but never reading more than 28 lines
  • sed '/:28$/ d; s/:[^:]*$//' removes the files that have at least 28 lines, and print the filename of the others

Alternate version: sequential processing instead of a parallel one

res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res"

Benchmarking

Ed Morton challenged my claim that this answer may be faster than awk. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results.

First the test platform:

  • A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).

  • A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command:

    for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done
    

Results:

  • grep+sed (this answer) : 561 ms elapsed, 586 ms user+sys
  • grep+sed (this answer, sequential version) : 678 ms elapsed, 688 ms user+sys
  • awk (Ed Morton): 1050 ms elapsed, 1036 ms user+sys
  • awk (tripleee): 1137 ms elapsed, 1123 ms user+sys
  • awk (anubhava): 1150 ms elapsed, 1137 ms user+sys
  • awk (kvantour): 1280 ms elapsed, 1266 ms user+sys
  • python (Joey Harrington): 1543 ms elapsed, 1537 ms user+sys
  • find+xargs+sed (agc): 91 s elapsed, 10 s user+sys
  • for+awk (Jeff Schaller): 247 s elapsed, 83 s user+sys
  • find+bash+grep (hek2mgl): 356 s elapsed, 116 s user+sys

Conclusion:

At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script.

Notes:

  • Why does the platform matter? Because my answer relies on parallelizing the processing between grep and sed. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.

  • Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings

  • Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)

  • For all the answers that take less than 10 s, the time spent by the shell to process *.txt is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.

  • All answers gave the same results except 1. tripleee's answer which includes argv[0] ("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with -v n=27); and 3. the find+sed answer which miss empty files (not fixed).

  • I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.

  • The python answer doesn't close its files. I had to do ulimit -n hard first.

like image 33
xhienne Avatar answered Oct 05 '22 23:10

xhienne


You may try this awk that moves to next file as soon as line count goes above 27:

awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]}
FNR > n{delete f[FILENAME]; nextfile}
END{for (i in f) print i}' *.txt

awk processes files line by line so it won't attempt to read complete file to get the line count.

like image 44
anubhava Avatar answered Oct 05 '22 23:10

anubhava


How's this?

awk 'BEGIN { for(i=1;i<ARGC; ++i) arg[ARGV[i]] }
  FNR==28 { delete arg[FILENAME]; nextfile }
  END { for (file in arg) print file }' *.txt

We copy the list of file name arguments to an associative array, then remove all files which have a 28th line from it. Empty files obviously won't match this condition, so at the end, we are left with all files which have fewer lines, including the empty ones.

nextfile was a common extension in many Awk variants and then was codified by POSIX in 2012. If you need this to work on really old dinosaur OSes (or, good heavens, probably Windows), good luck, and/or try GNU Awk.

like image 29
tripleee Avatar answered Oct 05 '22 23:10

tripleee