In the use-case of having the output of a singular command being consumed by only one other, is it better to use <code>|</code> (pipelines) or <code><()</code> (process substitution)? Better is, of course, subjective. For my specific use case I am after performance as the primary driver, but also interested in robustness. The <code>while read do done < <(cmd)</code> benefits I already know about and have switched over to. I have several <code>var=$(cmd1|cmd2)</code> instances that I suspect might be better replaced as <code>var=$(cmd2 < <(cmd1))</code>. I would like to know what specific benefits the latter case brings over the former.

tl;dr: Use pipes, unless you have a convincing reason not to. Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe. There are three practical differences: <h3>1. Bash defaults to creating a fork for every stage in a pipeline.</h3> Which is why you started looking into this in the first place: <pre class="prettyprint"><code>#!/bin/bash cat "$1" | while IFS= read -r last; do true; done echo "Last line of $1 is $last" </code></pre> This script won't work by default with a pipelines, because unlike <code>ksh</code> and <code>zsh</code>, <code>bash</code> will fork a subshell for each stage. If you set <code>shopt -s lastpipe</code> in bash 4.2+, bash mimics the <code>ksh</code> and <code>zsh</code> behavior and works just fine. <h3>2. Bash does not wait for process substitutions to finish.</h3> POSIX only requires a shell to wait for the last process in a pipeline, but most shells including <code>bash</code> will wait for all of them. This makes a notable difference when you have a slow producer, like in a <code>/dev/random</code> password generator: <pre class="prettyprint"><code>tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10 # Slow? head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random) # Fast? </code></pre> The first example will not benchmark favorably. Once <code>head</code> is satisfied and exits, <code>tr</code> will wait around for its next <code>write()</code> call to discover that the pipe is broken. Since bash waits for both <code>head</code> and <code>tr</code> to finish, it will appear seem slower. In the procsub version, bash only waits for <code>head</code>, and lets <code>tr</code> finish in the background. <h3>3. Bash does not currently optimize away forks for single simple commands in process substitutions.</h3> If you invoke an external command like <code>sleep 1</code>, then the Unix process model requires that bash forks and executes the command. Since forks are expensive, bash optimizes the cases that it can. For example, the command: <pre class="prettyprint"><code>bash -c 'sleep 1' </code></pre> Would naively incur two forks: one to run bash, and one to run <code>sleep</code>. However, bash can optimize it because there's no need for <code>bash</code> to stay around after <code>sleep</code> finishes, so it can instead just replace itself with <code>sleep</code> (<code>execve</code> with no <code>fork</code>). This is very similar to tail call optimization. <code>( sleep 1 )</code> is similarly optimized, but <code><( sleep 1 )</code> is not. The source code does not offer a particular reason why, so it may just not have come up. <pre class="prettyprint"><code>$ strace -f bash -c '/bin/true | /bin/true' 2>&1 | grep -c clone 2 $ strace -f bash -c '/bin/true < <(/bin/true)' 2>&1 | grep -c clone 3 </code></pre> <hr> Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default. And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

In bash, is it generally better to use process substitution or pipelines

1 Answers

tl;dr: Use pipes, unless you have a convincing reason not to.

Piping and redirecting stdin from a process substitution is essentially the same thing: both will result in two processes connected by an anonymous pipe.

There are three practical differences:

1. Bash defaults to creating a fork for every stage in a pipeline.

Which is why you started looking into this in the first place:

#!/bin/bash
cat "$1" | while IFS= read -r last; do true; done
echo "Last line of $1 is $last"

This script won't work by default with a pipelines, because unlike ksh and zsh, bash will fork a subshell for each stage.

If you set shopt -s lastpipe in bash 4.2+, bash mimics the ksh and zsh behavior and works just fine.

2. Bash does not wait for process substitutions to finish.

POSIX only requires a shell to wait for the last process in a pipeline, but most shells including bash will wait for all of them.

This makes a notable difference when you have a slow producer, like in a /dev/random password generator:

tr -cd 'a-zA-Z0-9' < /dev/random | head -c 10     # Slow?
head -c 10 < <(tr -cd 'a-zA-Z0-9' < /dev/random)  # Fast?

The first example will not benchmark favorably. Once head is satisfied and exits, tr will wait around for its next write() call to discover that the pipe is broken.

Since bash waits for both head and tr to finish, it will appear seem slower.

In the procsub version, bash only waits for head, and lets tr finish in the background.

3. Bash does not currently optimize away forks for single simple commands in process substitutions.

If you invoke an external command like sleep 1, then the Unix process model requires that bash forks and executes the command.

Since forks are expensive, bash optimizes the cases that it can. For example, the command:

bash -c 'sleep 1'

Would naively incur two forks: one to run bash, and one to run sleep. However, bash can optimize it because there's no need for bash to stay around after sleep finishes, so it can instead just replace itself with sleep (execve with no fork). This is very similar to tail call optimization.

( sleep 1 ) is similarly optimized, but <( sleep 1 ) is not. The source code does not offer a particular reason why, so it may just not have come up.

$ strace -f bash -c '/bin/true | /bin/true'     2>&1 | grep -c clone
2
$ strace -f bash -c '/bin/true < <(/bin/true)'  2>&1 | grep -c clone
3

Given the above you can create a benchmark favoring whichever position you want, but since the number of forks is generally much more relevant, pipes would be the best default.

And obviously, it doesn't hurt that pipes are the POSIX standard, canonical way of connecting stdin/stdout of two processes, and works equally well on all platforms.

answered Oct 14 '22 08:10

that other guy

Related questions
                            
                                How to split a tab-delimited string in bash script WITHOUT collapsing blanks?
                            
                                Running a script after startx automatically [closed]
                            
                                Curious tput behavior, with stderr redirection
                            
                                Using two interact in a Expect script
                            
                                How to avoid special characters when redirecting output in bash scripts
                            
                                Calculate Total disk i/o by a single process
                            
                                ssh with nodejs child_process, command not found on server
                            
                                Passing "*()" as an argument to a program in bash
                            
                                get line number with bash in R
                            
                                Logging java jar stdout & stderr to a file in linux
                            
                                Why is my bash script blocking?
                            
                                Why for-in loop doesn't print what I want? [duplicate]
                            
                                How to display different messages on whiptail progress bar along with progress bar?
                            
                                Why bash4 expands curly braces differently?
                            
                                Python pty.spawn stdin not echoed but redirected to master's stdout
                            
                                Can't attach to bash running the Docker container
                            
                                How to timeout a Bash command and count the number of lines emitted to stdout?
                            
                                sqlplus /nolog outputting --help and exiting`
                            
                                How to load bash script in Jenkins pipeline?
                            
                                pandoc complains about utf-8 decoding error even if my file is valid utf-8 encoded file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In bash, is it generally better to use process substitution or pipelines

Tags:

bash

Ian

People also ask