Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why avoid subshells?

Tags:

bash

subshell

I've seen a lot of answers and comments on Stack Overflow that mention doing something to avoid a subshell. In some cases, a functional reason for this is given (most often, the potential need to read a variable outside the subshell that was assigned inside it), but in other cases, the avoidance seems to be viewed as an end in itself. For example

  • union of two columns of a tsv file
    suggesting { ... ; } | ... rather than ( ... ) | ..., so there's a subshell either way.

  • unhide hidden files in unix with sed and mv commands

  • Linux bash script to copy files
    explicitly stating, "the goal is just to avoid a subshell"

Why is this? Is it for style/elegance/beauty? For performance (avoiding a fork)? For preventing likely bugs? Something else?

like image 448
ruakh Avatar asked Feb 24 '14 00:02

ruakh


People also ask

What is the purpose of Subshells?

A subshell is basically a new shell just to run a desired program. A subshell can access the global variables set by the 'parent shell' but not the local variables. Any changes made by a subshell to a global variable is not passed to the parent shell.

Do Subshells inherit variables?

Unlike calling a shell script, subshells inherit the same variables as the original process, and thus can access any of these (even those which have not been exported).

Does pipe create a subshell?

Solution. Pipelines create subshells. Changes in the while loop do not effect the variables in the outer part of the script, as the while loop is run in a subshell.

What type of processing can be done using a subshell?

Whenever you run a shell script, it creates a new process called subshell and your script will get executed using a subshell. A Subshell can be used to do parallel processing.


2 Answers

There are a few things going on.

First, forking a subshell might be unnoticible when it happens only once, but if you do it in a loop, it adds up to measurable performance impact. The performance impact is also greater on platforms such as Windows where forking is not as cheap as it is on modern Unixlikes.

Second, forking a subshell means you have more than one context, and information is lost in switching between them -- if you change your code to set a variable in a subshell, that variable is lost when the subshell exits. Thus, the more your code has subshells in it, the more careful you have to be when modifying it later to be sure that any state changes you make will actually persist.

See BashFAQ #24 for some examples of surprising behavior caused by subshells.

like image 177
Charles Duffy Avatar answered Nov 16 '22 04:11

Charles Duffy


sometimes examples are helpful.

f='fred';y=0;time for ((i=0;i<1000;i++));do if [[ -n "$( grep 're' <<< $f )" ]];then ((y++));fi;done;echo $y

real    0m3.878s
user    0m0.794s
sys 0m2.346s
1000

f='fred';y=0;time for ((i=0;i<1000;i++));do if [[ -z "${f/*re*/}" ]];then ((y++));fi;done;echo $y

real    0m0.041s
user    0m0.027s
sys 0m0.001s
1000

f='fred';y=0;time for ((i=0;i<1000;i++));do if grep -q 're' <<< $f ;then ((y++));fi;done >/dev/null;echo $y

real    0m2.709s
user    0m0.661s
sys 0m1.731s
1000

As you can see, in this case, the difference between using grep in a subshell and parameter expansion to do the same basic test is close to 100x in overall time.

Following the question further, and taking into account the comments below, which clearly fail to indicate what they are trying to indicate, I checked the following code: https://unix.stackexchange.com/questions/284268/what-is-the-overhead-of-using-subshells

time for((i=0;i<10000;i++)); do echo "$(echo hello)"; done >/dev/null 
real    0m12.375s
user    0m1.048s
sys 0m2.822s

time for((i=0;i<10000;i++)); do echo hello; done >/dev/null 
real    0m0.174s
user    0m0.165s
sys 0m0.004s

This is actually far far worse than I expected. Almost two orders of magnitude slower in fact in overall time, and almost THREE orders of magnitude slower in sys call time, which is absolutely incredible. https://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html

Note that the point of demonstrating this is to show that if you are using a testing method that's quite easy to fall into the habit of using, subshell grep, or sed, or gawk (or a bash builtin, like echo), which is for me a bad habit I tend to fall into when hacking fast, it's worth realizing that this will have a significant performance hit, and it's probably worth the time avoiding those if bash builtins can handle the job natively.

By carefully reviewing a large programs use of subshells, and replacing them with other methods, when possible, I was able to cut about 10% of the overall execution time in a just completed set of optimizations (not the first, and not the last, time I have done this, it's already been optimized several times, so gaining another 10% is actually quite significant)

So it's worth being aware of.

Because I was curious, I wanted to confirm what 'time' is telling us here: https://en.wikipedia.org/wiki/Time_(Unix)

The total CPU time is the combination of the amount of time the CPU or CPUs spent performing some action for a program and the amount of time they spent performing system calls for the kernel on the program's behalf. When a program loops through an array, it is accumulating user CPU time. Conversely, when a program executes a system call such as exec or fork, it is accumulating system CPU time.

As you can see in particularly the echo loop test, the cost of the forks is very high in terms of system calls to the kernel, those forks really add up (700x!!! more time spent on sys calls).

I'm in an ongoing process of resolving some of these issues, so these questions are actually quite relevant to me, and the global community of users who like the program in question, that is, this is not an arcane academic point for me, it's realworld, with real impacts.

like image 27
Lizardx Avatar answered Nov 16 '22 04:11

Lizardx