Say I have the following structure of files and directories:
$ tree
.
├── a
├── b
└── dir
└── c
1 directory, 3 files
That is, two files a
and b
together with a dir dir
, where another file c
stands.
I want to process all the files with awk
(GNU Awk 4.1.1
, exactly), so I do something like this:
$ gawk '{print FILENAME; nextfile}' * */*
a
b
awk: cmd. line:1: warning: command line argument `dir' is a directory: skipped
dir/c
All is fine but the *
also expands to the directory dir
and awk
tries to process it.
So I wonder: is there any native way awk
can check if the given element is a file or not and, if so, skip it? That is, without using system()
for it.
I made it work by calling the external system
in BEGINFILE:
$ gawk 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, FNR}' * */*
a
a 10
a.wk
a.wk 3
b
b 10
dir
dir is a dir, skipping
dir/c
dir/c 10
Note also the fact that if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}
works counter intuitively: it should return 1 when true, but it returns the exit code.
I read in A.5 Extensions in gawk Not in POSIX awk:
- Directories on the command line produce a warning and are skipped (see Command-line directories)
And then the linked page says:
4.11 Directories on the Command Line
According to the POSIX standard, files named on the awk command line must be text files; it is a fatal error if they are not. Most versions of awk treat a directory on the command line as a fatal error.
By default, gawk produces a warning for a directory on the command line, but otherwise ignores it. This makes it easier to use shell wildcards with your awk program:
$ gawk -f whizprog.awk * Directories could kill this program
If either of the --posix or --traditional options is given, then gawk reverts to treating a directory on the command line as a fatal error.
See Extension Sample Readdir, for a way to treat directories as usable data from an awk program.
And in fact it is the case: the same command as before with --posix
fails:
$ gawk --posix 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, NR}' * */*
gawk: cmd. line:1: fatal: cannot open file `dir' for reading (Is a directory)
I checked the 16.7.6 Reading Directories
section that is linked above and they talk about readdir
:
The readdir extension adds an input parser for directories. The usage is as follows:
@load "readdir"
But I am not sure neither how to call it nor how to use it from the command line.
I would simply avoid to pass directories to awk since even POSIX says that all filename args must be text files.
You can use find
for traversing the directory:
find PATH -type f -exec awk 'program' {} +
If you wanted to safeguard your script from other people mistakenly passing a directory (or anything else that's not a readable text file) to it, you could do this:
$ ls -F tmp
bar dir/ foo
$ cat tmp/foo
line 1
$ cat tmp/bar
line 1
line 2
$ cat tmp/dir
cat: tmp/dir: Is a directory
$ cat tst.awk
BEGIN {
for (i=1;i<ARGC;i++) {
if ( (getline line < ARGV[i]) <= 0 ) {
print "Skipping:", ARGV[i], ERRNO
delete ARGV[i]
}
close(ARGV[i])
}
}
{ print FILENAME, $0 }
$ awk -f tst.awk tmp/*
Skipping: tmp/dir Is a directory
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1
$ awk --posix -f tst.awk tmp/*
Skipping: tmp/dir
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1
Per POSIX getline
returns -1
if/when it fails trying to retrieve a record from a file (e.g. unreadable file or file does not exist or file is a directory), you just need GNU awk to tell you which of those failures it was by the value of ERRNO
if you care.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With