Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does buffer in bash pipe work on linux? [duplicate]

Tags:

linux

bash

pipe

Think of a simple command as following:

cmd1 | cmd2

Does cmd2 start to execute

  1. as soon as cmd1 outputs something
  2. or only if cmd1 completely finishes and exits?

In case 1 when cmd1 outputs faster than the speed at which cmd2 consumes, or simply in case 2, there has to be a buffer for the intermediate output.

  1. Where is that buffer located? Is it in memory or on disk?
  2. Is it possible to configure the buffer's location and size?
  3. What would happen when the buffer is not big enough?
like image 553
Fermat's Little Student Avatar asked Jul 21 '18 02:07

Fermat's Little Student


1 Answers

The cmd2 program starts to run immediately, but whenever it tries to read input, it'll "block" (stop and wait) if necessary until some is available. This is done automatically by the kernel. Other than that, the two programs can run concurrently (including at the same time on different CPU cores).

The buffer between the two processes is held by the kernel, and it's in memory (though it might be possible for it to be paged out — I'm not sure). The default size of the buffer doesn't seem to be configurable, but programs can request a bigger size for a specific pipe, and the limit for that is configurable by writing to the /proc/sys/fs/pipe-max-size file (which, being in /proc, isn't a actually a file on disk; it's a virtual file that accesses a setting in the kernel.) See this question for more information.

If cmd1 tries to write but the buffer is full, it will block until some space becomes available in the buffer (which happens when cmd2 reads some of the buffered data). So if cmd1 is producing output too fast, it'll be automatically be slowed down by having to wait for cmd2 to consume the output.

If the buffer is small, the programs may end up blocking more frequently while waiting on it, which can make them take longer to finish because they'll be spending more time waiting.

In general, there are two categories that most pipelines are likely to fall into:

  • cmd1 produces output faster than cmd2 consumes it: the buffer is usually full (or close to it) and cmd1 often blocks when trying to write, which slows it down to match the speed of cmd2. cmd2 is able to run at full speed because input is always available in the buffer, so it rarely has to block on reading.
  • cmd2 consumes input faster than cmd1 produces it: the buffer is usually empty (or close to it), and cmd2 often blocks when trying to read, which slows it down to match the speed of cmd1. cmd1 is able to run at full speed because there's always space available for writing to the buffer, so it rarely has to block on writing.
like image 157
Wyzard Avatar answered Nov 15 '22 05:11

Wyzard