How to efficiently list files that have exactly `n` lines?

Question

In order to list files that have exactly n lines, one can do

n=5
find . -name "*.txt" | xargs wc -l | awk -v n=${n} -F" " '{if ($1==n) {print $2} }'

but this solution is quite slow as it counts the number of lines for each file first and then only select those that have n lines. A process that would count the lines and stops when it reaches n+1 lines would be much more efficient (esp. when dealing with big files that have plenty of lines).

How to efficiently list files that have exactly n lines?

Note, for the special case, where each line is of exactly the same size, then one could probably do

n=5
sizePerLine=500
find . -name '*.txt' -size $(( ${n} * ${sizePerLine} ))

Eric Renouf · Accepted Answer

I think the following would be faster:

find . -name "*.txt" -exec awk -v n="$n" 'FILENAME != prevfile {if(prevfnr==n) print prevfile} {prevfile = FILENAME; prevfnr = FNR; if(FNR>n) {nextfile;}} END{if (FNR==n) {print FILENAME} }' {} +

How it works:

use -exec ... {} + to use find to execute the command for each file, and let it pass many args per invokation
awk -v n="$n" invoke awk and define an awk variable called n to have the same value as the shell variable n
FILENAME != prevfile {if(prevfnr==n) print prevfile checks if the current file is the same as the last record was in, and if not see if the previous file had exactly n records, if so print the name of that file
{prevfile = FILENAME; prevfnr = FNR; if(FNR>n) {nextfile;}} update the prevfile variable with the current FILENAME and the prevfnr variable with the current FNR. Also, if our current file record is over n, jump to the next file without processing anything more here
END{if (FNR==n) {print FILENAME} at the end see if the last file also had exactly n records

Interestingly, I found that this actually gives different results than the version that uses wc -l, though I think this one is probably actually more correct. For files in my directory whose last line does not include a line ending character wc -l would report the number of lines, not counting the last "unterminated" line, but the solution here will count it.

Arg, I had failed to appreciate that nextfile is a GNU-ism. If I'm already limiting myself to that we can make this much cleaner as

find . -name '*.txt' -exec  awk -v n="$n" 'FNR > n {nextfile;} ENDFILE{if (FNR==n) {print FILENAME} }' {} +

it doesn't seem to me that POSIX awk has a good shortcut to jump to the next file, which is the key that this solution needs for it's efficiency

How to efficiently list files that have exactly `n` lines?

Tags:

performance

file

find

bash

awk

Remi.b

1 Answers

Eric Renouf

Recent Activity

Donate For Us

How to efficiently list files that have exactly `n` lines?

Tags:

performance

file

find

bash

awk

Remi.b

1 Answers

Eric Renouf

Related questions

Recent Activity

Donate For Us