Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In bash, is it generally better to use process substitution or pipelines

Tags:

bash

In the use-case of having the output of a singular command being consumed by only one other, is it better to use | (pipelines) or <() (process substitution)?

Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness.

The while read do done < <(cmd) benefits I already know about and have switched over to.

I have several var=$(cmd1|cmd2) instances that I suspect might be better replaced as var=$(cmd2 < <(cmd1)).

I would like to know what specific benefits the latter case brings over the former.

like image 635
Ian Avatar asked Jan 28 '18 09:01

Ian


People also ask

What is process substitution in bash?

Process substitution allows a process's input or output to be referred to using a filename. It takes the form of. <( list ) or. >( list )

Can you pipe in a bash script?

A pipe in Bash takes the standard output of one process and passes it as standard input into another process. Bash scripts support positional arguments that can be passed in at the command line. Guiding principle #1: Commands executed in Bash receive their standard input from the process that starts them.

What does pipeline do in bash?

A pipeline is a sequence of one or more commands separated by one of the control operators ' | ' or ' |& '. The output of each command in the pipeline is connected via a pipe to the input of the next command. That is, each command reads the previous command's output.

Why pipeline is used in Linux?

Pipe is used to combine two or more commands, and in this, the output of one command acts as input to another command, and this command's output may act as input to the next command and so on. It can also be visualized as a temporary connection between two or more commands/ programs/ processes.


1 Answers

tl;dr: Use pipes, unless you have a convincing reason not to.

Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.

There are three practical differences:

1. Bash defaults to creating a fork for every stage in a pipeline.

Which is why you started looking into this in the first place:

#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"

This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.

If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.

2. Bash does not wait for process substitutions to finish.

POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.

This makes a notable difference when you have a slow producer, like in a /dev/random password generator:

tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10     # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random)  # Fast?

The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.

Since bash waits for both head and tr to finish, it will appear seem slower.

In the procsub version, bash only waits for head, and lets tr finish in the background.

3. Bash does not currently optimize away forks for single simple commands in process substitutions.

If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.

Since forks are expensive, bash optimizes the cases that it can. For example, the command:

bash -c 'sleep 1'

Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.

( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.

$ strace -f bash -c '/bin/true | /bin/true'     2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)'  2>&1 | grep -c clone
3

Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.

And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

like image 78
that other guy Avatar answered Oct 14 '22 08:10

that other guy