In a folder, I would like to print the name of every .txt
files that contain n=27
lines or fewer lines. I could do
wc -l *.txt | awk '{if ($1 <= 27){print}}'
The problem is that many files in the folder are millions of lines (and the lines are pretty long) and hence the command wc -l *.txt
is very slow. In principle a process could count the number of lines until finding at least n
lines and then proceed to the next file.
What is a faster alternative?
FYI, I am on MAC OSX 10.11.6
Here is an attempt with awk
#!/bin/awk -f
function printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
{
if (previousNbLines <= n)
{
print previousNbLines": "previousFILENAME
}
}
BEGIN{
previousNbLines=n+1
previousFILENAME=NA
}
{
if (FNR==1)
{
printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
previousFILENAME=FILENAME
}
previousNbLines=FNR
if (FNR > n)
{
nextfile
}
}
END{
printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
}
which can be called as
awk -v n=27 -f myAwk.awk *.txt
However, the code fails at printing out perfectly empty files. I am not sure how to fix that and I am not sure my awk script is the way to go.
How to sort by number. To sort by number pass the -n option to sort . This will sort from lowest number to highest number and write the result to standard output. Suppose a file exists with a list of items of clothing that has a number at the start of the line and needs to be sorted numerically.
With GNU awk for nextfile and ENDFILE:
awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt
With any awk:
awk -v n=27 '
{ fnrs[FILENAME] = FNR }
END {
for (i=1; i<ARGC; i++) {
filename = ARGV[i]
if ( fnrs[filename] < n ) {
print filename
}
}
}
' *.txt
Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:
awk 'script' foo bar foo
) and you wanting it displayed multiple times, andawk 'script' foo FS=, bar
)The gawk version has no such restrictions.
UPDATE:
To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script
I created 10,000 input files, all of 0 to 1000 lines in length by using this script:
$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'
and then ran the 2 commands on them and got these 3rd run timing results:
$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsed
real 0m1.326s
user 0m0.249s
sys 0m0.654s
$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awk
real 0m1.092s
user 0m0.343s
sys 0m0.748s
Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.
To print 10 lines of up to 20 random chars per line (see the comments):
$ maxChars=20
LC_ALL=C tr -dc '[:print:]' </dev/urandom |
fold -w "$maxChars" |
awk -v maxChars="$maxChars" -v numLines=10 '
{ print substr($0,1,rand()*(maxChars+1)) }
NR==numLines { exit }
'
0J)-8MzO2V\XA/o'qJH
@r5|g<WOP780
^O@bM\
vP{l^pgKUFH9
-6r&]/-6dl}pp W
&.UnTYLoi['2CEtB
Y~wrM3>4{
^F1mc9
?~NHh}a-EEV=O1!y
of
To do it all within awk (which will be much slower):
$ cat tst.awk
BEGIN {
for (i=32; i<127; i++) {
chars[++charsSize] = sprintf("%c",i)
}
minChars = 1
maxChars = 20
srand()
for (lineNr=1; lineNr<=10; lineNr++) {
numChars = int(minChars + rand() * (maxChars - minChars + 1))
str = ""
for (charNr=1; charNr<=numChars; charNr++) {
charsIdx = int(1 + rand() * charsSize)
str = str chars[charsIdx]
}
print str
}
}
$ awk -f tst.awk
Heer H{QQ?qHDv|
Psuq
Ey`-:O2v7[]|N^EJ0
j#@/y>CJ3:=3*b-joG:
?
^|O.[tYlmDo
TjLw
`2Rs=
!('IC
hui
If you are using GNU grep
(unfortunately MacOSX >= 10.8 provides BSD grep whose -m
and -c
options act globally, not per file), you may find this alternative interesting (and faster than a pure awk
script):
grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//'
Explanation:
grep -c -m28 -H ^ *.txt
outputs the name of each file with the number of lines in each file, but never reading more than 28 linessed '/:28$/ d; s/:[^:]*$//'
removes the files that have at least 28 lines, and print the filename of the othersAlternate version: sequential processing instead of a parallel one
res=$(grep -c -m28 -H ^ $files); sed '/:28$/ d; s/:[^:]*$//' <<< "$res"
Ed Morton challenged my claim that this answer may be faster than awk
. He added some benchmarks to his answer and, although he does not give any conclusion, I consider the results he posted are misleading, showing a greater wall-clock time for my answer without any regard to user and sys times. Therefore, here are my results.
First the test platform:
A four-core Intel i5 laptop running Linux, probably quite close to OP's system (Apple iMac).
A brand new directory of 100.000 text files with ~400 lines in average, for a total of 640 MB which is kept entirely in my system buffers. The files were created with this command:
for ((f = 0; f < 100000; f++)); do echo "File $f..."; for ((l = 0; l < RANDOM & 1023; l++)); do echo "File $f; line $l"; done > file_$f.txt; done
Results:
Conclusion:
At the time of writing, on a regular Unix multi-core laptop similar to OP's machine, this answer is the fastest that gives accurate results. On my machine, it is twice as fast as the fastest awk script.
Notes:
Why does the platform matter? Because my answer relies on parallelizing the processing between grep
and sed
. Of course, for unbiased results, if you have only one CPU core (VM?) or other limitations by your OS regarding CPU allocation, you should benchmark the alternate (sequential) version.
Obviously, you can't conclude on the wall time alone since it depends on the number of concurrent processes asking for the CPU vs the number of cores on the machine. Therefore I have added the user+sys timings
Those timings are an average over 20 runs, except when the command took more than 1 minute (one run only)
For all the answers that take less than 10 s, the time spent by the shell to process *.txt
is not negligible, therefore I preprocessed the file list, put it in a variable, and appended the content of the variable to the command I was benchmarking.
All answers gave the same results except 1. tripleee's answer which includes argv[0]
("awk") in its result (fixed in my tests); 2. kvantour's answer which only listed empty files (fixed with -v n=27
); and 3. the find+sed answer which miss empty files (not fixed).
I couldn't test ctac_'s answer since I have no GNU sed 4.5 at hand. It is probably the fastest of all but also misses empty files.
The python answer doesn't close its files. I had to do ulimit -n hard
first.
You may try this awk
that moves to next file as soon as line count goes above 27
:
awk -v n=27 'BEGIN{for (i=1; i<ARGC; i++) f[ARGV[i]]}
FNR > n{delete f[FILENAME]; nextfile}
END{for (i in f) print i}' *.txt
awk
processes files line by line so it won't attempt to read complete file to get the line count.
How's this?
awk 'BEGIN { for(i=1;i<ARGC; ++i) arg[ARGV[i]] }
FNR==28 { delete arg[FILENAME]; nextfile }
END { for (file in arg) print file }' *.txt
We copy the list of file name arguments to an associative array, then remove all files which have a 28th line from it. Empty files obviously won't match this condition, so at the end, we are left with all files which have fewer lines, including the empty ones.
nextfile
was a common extension in many Awk variants and then was codified by POSIX in 2012. If you need this to work on really old dinosaur OSes (or, good heavens, probably Windows), good luck, and/or try GNU Awk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With