Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to skip a directory in awk?

Tags:

awk

dir

gawk

Say I have the following structure of files and directories:

$ tree
.
├── a
├── b
└── dir
    └── c

1 directory, 3 files

That is, two files a and b together with a dir dir, where another file c stands.

I want to process all the files with awk (GNU Awk 4.1.1, exactly), so I do something like this:

$ gawk '{print FILENAME; nextfile}' * */*
a
b
awk: cmd. line:1: warning: command line argument `dir' is a directory: skipped
dir/c

All is fine but the * also expands to the directory dir and awk tries to process it.

So I wonder: is there any native way awk can check if the given element is a file or not and, if so, skip it? That is, without using system() for it.

I made it work by calling the external system in BEGINFILE:

$ gawk 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, FNR}' * */*
a
a 10
a.wk
a.wk 3
b
b 10
dir
dir is a dir, skipping
dir/c
dir/c 10

Note also the fact that if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile} works counter intuitively: it should return 1 when true, but it returns the exit code.

I read in A.5 Extensions in gawk Not in POSIX awk:

  • Directories on the command line produce a warning and are skipped (see Command-line directories)

And then the linked page says:

4.11 Directories on the Command Line

According to the POSIX standard, files named on the awk command line must be text files; it is a fatal error if they are not. Most versions of awk treat a directory on the command line as a fatal error.

By default, gawk produces a warning for a directory on the command line, but otherwise ignores it. This makes it easier to use shell wildcards with your awk program:

$ gawk -f whizprog.awk *        Directories could kill this program

If either of the --posix or --traditional options is given, then gawk reverts to treating a directory on the command line as a fatal error.

See Extension Sample Readdir, for a way to treat directories as usable data from an awk program.

And in fact it is the case: the same command as before with --posix fails:

$ gawk --posix 'BEGINFILE{print FILENAME; if (system(" [ ! -d " FILENAME " ]")) {print FILENAME, "is a dir, skipping"; nextfile}} ENDFILE{print FILENAME, NR}' * */*
gawk: cmd. line:1: fatal: cannot open file `dir' for reading (Is a directory)

I checked the 16.7.6 Reading Directories section that is linked above and they talk about readdir:

The readdir extension adds an input parser for directories. The usage is as follows:

@load "readdir"

But I am not sure neither how to call it nor how to use it from the command line.

like image 760
fedorqui 'SO stop harming' Avatar asked Dec 01 '15 10:12

fedorqui 'SO stop harming'


2 Answers

I would simply avoid to pass directories to awk since even POSIX says that all filename args must be text files.

You can use find for traversing the directory:

find PATH -type f -exec awk 'program' {} +
like image 174
hek2mgl Avatar answered Oct 02 '22 02:10

hek2mgl


If you wanted to safeguard your script from other people mistakenly passing a directory (or anything else that's not a readable text file) to it, you could do this:

$ ls -F tmp
bar  dir/  foo

$ cat tmp/foo
line 1

$ cat tmp/bar
line 1
line 2

$ cat tmp/dir
cat: tmp/dir: Is a directory

$ cat tst.awk
BEGIN {
    for (i=1;i<ARGC;i++) {
        if ( (getline line < ARGV[i]) <= 0 ) {
            print "Skipping:", ARGV[i], ERRNO
            delete ARGV[i]
        }
        close(ARGV[i])
    }
}
{ print FILENAME, $0 }

$ awk -f tst.awk tmp/*
Skipping: tmp/dir Is a directory
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1

$ awk --posix -f tst.awk tmp/*
Skipping: tmp/dir
tmp/bar line 1
tmp/bar line 2
tmp/foo line 1

Per POSIX getline returns -1 if/when it fails trying to retrieve a record from a file (e.g. unreadable file or file does not exist or file is a directory), you just need GNU awk to tell you which of those failures it was by the value of ERRNO if you care.

like image 39
Ed Morton Avatar answered Oct 02 '22 01:10

Ed Morton