Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

bash zcat head causes pipefail?

Tags:

bash

set -eu 
VAR=$(zcat file.gz  |  head -n 12)

works fine

set -eu   -o pipefail
VAR=$(zcat file.gz  |  head -n 12)

causes bash to exit with failure. How is this causing a pipefail?

Note that file.gz contains millions of lines (~ 750 MB, compressed).

like image 291
cmo Avatar asked Jan 06 '17 23:01

cmo


2 Answers

Think about it, for a moment.

  1. You're telling the shell that your entire pipeline should be considered to have failed if any component failed.
  2. You're telling zcat to write its output to head.
  3. Then you're telling head to exit after reading 12 lines, out of a much-longer-than-12-line input stream.

Of course you have an error: zcat has its destination pipeline closed early, and wasn't able to successfully write a decompressed version of your input file! It doesn't have any way of knowing that this was due to user intent, via something erroneous happening.

If you were using zcat to write to a disk and it ran out of space, or to a network stream and there was a connection loss, it would be entirely correct and appropriate for it to exit with a status indicating a failure. This is simply another case of that rule.


The specific error which zcat is being given by the operating system is EPIPE, returned by the write syscall under the following condition: An attempt is made to write to a pipe that is not open for reading by any process.

After head (the only reader of this FIFO) has exited, for any write to the input side of pipeline not to return EPIPE would be a bug. For zcat to silently ignore an error writing its output, and thus be able to generate an inaccurate output stream without an exit status reflecting this event, would likewise be a bug.


If you don't want to change any of your shell options, by the way, one workaround you might consider is using process substitution:

var=$(head -n 12 < <(zcat file.gz))

In this case, zcat is not a pipeline component, and its exit status is not considered for purposes of determining success. (You might test whether $var is 12 lines long, if you want to come up with an independent success/fail determination).


A more comprehensive solution could be implemented by pulling in a Python interpreter, with its native gzip support. A native Python implementation (compatible with both Python 2 and 3.x), embedded in a shell script, might look something like:

zhead_py=$(cat <<'EOF'
import sys, gzip
gzf = gzip.GzipFile(sys.argv[1], 'rb')
outFile = sys.stdout.buffer if hasattr(sys.stdout, 'buffer') else sys.stdout
numLines = 0
maxLines = int(sys.argv[2])
for line in gzf:
    if numLines >= maxLines:
        sys.exit(0)
    outFile.write(line)
    numLines += 1
EOF
)
zhead() { python -c "$zhead_py" "$@"; }

...which gets you a zhead that doesn't fail if it runs out of input data, but does pass through a failed exit status for genuine I/O failures or other unexpected events. (Usage is of the form zhead in.gz 5, to read 5 lines from in.gz).

like image 192
Charles Duffy Avatar answered Oct 11 '22 15:10

Charles Duffy


Alternatively, you can use

zcat file.gz  | awk '(NR<=12)'

The price is that you need to go through all the zcat, no early stop based on the lines you specified.

like image 22
user12309618 Avatar answered Oct 11 '22 14:10

user12309618