Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does "cmd | getline" have a limit for the passed data?

Stumbled across a certain limit (?) when passing data through "cmd | getline" (with both macOS awk 20200816 and GNU Awk 5.3.1).

Reproducible examples:

file.txt - a text file filled with "lorem ipsum" and one <title>test</title> string for the testing purposes, no newlines. Instead of xmllint other program can be used, for example grep -o \047^[^ ]*\047 with the only difference that for grep minimal "working" size is 30 bytes larger, 1 044 977 bytes.

gawk '{ cmd = ("xmllint --html --xpath \047//title\047 - 2>/dev/null <<< \047"$0"\047")
         cmd | getline word
         print word
         close(cmd)
       }' < file.txt

This prints the "word" value (i.e., <title>test</title>) if the file contains 1 044 947 characters or less, if more - prints nothing.

gawk '{ cmd = ("xmllint --html --xpath \047//title\047 - 2>/dev/null")
         print $0 | cmd
       }' < file-x2.txt

This prints <title>test</title> even if the file's content is twice as big. It would be OK, but I need the result held in a variable, not printed in the terminal stright away.

What is this limit? Is it imposed by gawk, shell, OS?

like image 471
jack Avatar asked Oct 29 '25 18:10

jack


1 Answers

The limit does not come from the |, but from passing your whole $0 on cmd's command line rather than through its stdin (you did it on purpose, as a workaround to awk proposing either print $0 | cmd or cmd | getline word, but no double pipe as in print $0 | cmd | getline word).

Your OS theorically defines a limit to the maximum length that you can pass as arguments to a command (you can see this limit thanks to getconf ARG_MAX, which should show you this 1 MB limit);
however, the shells (bash on my FreeBSD, but probably zsh too on your macOS) do not impose this restriction when using the <<< operator to convert data passed on the command line to stdin for the forked process; you can verify it by running xmllint --html --xpath '//title' - 2>/dev/null <<< "`cat file.txt`": you'll won't encounter this limit.

However, before reaching the shell (where this special rule of "the string after <<< looks like an argument but in fact is an stdin" is implemented),
awk has to pass this string uninterpreted through popen(), where ARG_MAX is enforced.

Instead of truncating data that exceeds that limit, popen(), although returning a valid FILE * (not allowing awk to detect there was problem), simply discards it entirely, thus a 0-length input passed to your xmllint.

A solution?

To circumvent this limit, I'd say you could print to a temporary file (or a pipe file), then xmllint from this file or pipe.

Testing popen()'s limits

You can compile a small C program to test invoking an echo through popen(), to emulate awk's internal workings:

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char ** argv)
{
    int bs, ts, n;
    char * cmd;
    char buf[0x10000];
    long argmax = sysconf(_SC_ARG_MAX);
    fprintf(stdout, "ARG_MAX = %ld\n", argmax);
    argmax += 10;
    cmd = malloc(argmax);
    strcpy(cmd, "echo ");
    memset(&cmd[strlen(cmd)], 'g', argmax - strlen(cmd) - 1);
    for(n = 0; ++n < 2000;)
    {   
        cmd[argmax - n] = 0;
        FILE * f = popen(cmd, "r");
        ts = 0;
        while((bs = fread(buf, 1, 0x10000, f)) > 0)
            ts += bs;
        fclose(f);
        fprintf(stdout, "%lu bytes cmd → %d bytes echo (popen returned %p)\n", strlen(cmd), ts, f);
    }
}

On my FreeBSD 14.3 system, I get:

ARG_MAX = 524288
524297 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
524296 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
…
523013 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523012 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523011 bytes cmd → 0 bytes echo (popen returned 0x8225f9c40)
523010 bytes cmd → 523006 bytes echo (popen returned 0x8225f9c40)
523009 bytes cmd → 523005 bytes echo (popen returned 0x8225f9c40)
523008 bytes cmd → 523004 bytes echo (popen returned 0x8225f9c40)
…

so I can pass up to ARG_MAX - 1278 bytes to my command.

like image 98
Guillaume Outters Avatar answered Nov 01 '25 11:11

Guillaume Outters



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!